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Abstract 



We consider how to optimize memory use and computation time in oper- 
ating a quantum computer. In particular, we estimate the number of memory 
qubits and the number of operations required to perform factorization, using 
the algorithm suggested by Shor. A K-h\t number can be factored in time of 
order K 3 using a machine capable of storing 5K + 1 qubits. Evaluation of the 
modular exponential function (the bottleneck of Shor's algorithm) could be 
achieved with about 72K 3 elementary quantum gates; implementation using 
a linear ion trap would require about 396-KT 3 laser pulses. A proof-of-principle 
demonstration of quantum factoring (factorization of 15) could be performed 
with only 6 trapped ions and 38 laser pulses. Though the ion trap may never 
be a useful computer, it will be a powerful device for exploring experimentally 
the properties of entangled quantum states. 
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I. INTRODUCTION AND SUMMARY 



Recently, Shor has exhibited a probabilistic algorithm that enables a quantum com- 
puter to find a nontrivial factor of a large composite number A in a time bounded from 
above by a polynomial in log(A). As it is widely believed that no polynomial-time factoriza- 
tion algorithm exists for a classical Turing machine, Shor's result indicates that a quantum 
computer can efficiently perform interesting computations that are intractable on a classical 
computer, as had been anticipated by Feynman 0, Deutsch ||, and others j|. 

Furthermore, Cirac and Zoller || have suggested an ingenious scheme for performing 
quantum computation using a potentially realizable device. The machine they envisage is 
an array of cold ions confined in a linear trap, and interacting with laser beams. Such linear 
ion traps have in fact been built |J, and these devices are remarkably well protected from 
the debilitating effects of decoherence. Thus, the Cirac-Zoller proposal has encouraged spec- 
ulation that a proof-of-principle demonstration of quantum factoring might be performed in 
the reasonably near future. 

Spurred by these developments, we have studied the computational resources that are 
needed to carry out the factorization algorithm using the linear ion trap computer or a 
comparable device. Of particular interest is the inevitable tension between two competing 
requirements. Because of practical limitations on the number of ions that can be stored 
in the trap, there is a strong incentive to minimize the number of qubits in the device by 
managing memory resources frugally. On the other hand, the device has a characteristic 
decoherence time scale, and the computation will surely crash if it takes much longer that 
the decoherence time. For this reason, and because optimizing speed is desirable anyway, 
there is a strong incentive to minimize the total number of elementary operations that must 
be completed during the computation. A potential rub is that frugal memory management 
may result in longer computation time. 

One of our main conclusions, however, is that substantial squeezing of the needed memory 
space can be achieved without sacrificing much in speed. A quantum computer capable of 
storing 5A + 1 qubits can run Shor's algorithm to factor a A-bit number A in a time of 
order A 3 . Faster implementations of the algorithm are possible for asymptotically large A, 
but these require more qubits, and are relatively inefficient for values of A that are likely to 
be of practical interest. For these values of A, a device with unlimited memory using our 
algorithms would be able to run only a little better than twice as fast as fast as a device that 
stores 5A + 1 qubits. Further squeezing of the memory space is also possible, but would 
increase the computation time to a higher power of A. 

Shor's algorithm (which we will review in detail in the next section) includes the evalu- 
ation of the modular exponential function; that is, a unitary transformation U that acts on 
elements of the computational basis as 

U : \a)i\0) o i— > \a)i\x a {mod N)) . (1.1) 

Here A is the A-bit number to be factored, a is an L-bit number (where usually L « 2A), 
and x is a randomly selected positive integer less than A that is relatively prime to A; 
and \-) Q denote the states of the "input" and "output" registers of the machine, respectively. 
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Shor's algorithm aims to find the period of this function, the order of x mod N. From the 
order of x, a factor of N can be extracted with reasonable likelihood, using standard results 
of number theory. 

To perform factorization, one first prepares the input register in a coherent superposition 
of all possible L-bit computational basis states: 

^EV><- (1-2) 

Preparation of this state is relatively simple, involving just L one-qubit rotations (or, for the 
Cirac-Zoller device, just L laser pulses applied to the ions in the trap). Then the modular 
exponential function is evaluated by applying the transformation U above. Finally, a discrete 
Fourier transformation is applied to the input register, and the input register is subsequently 
measured. From the measured value, the order of x mod N can be inferred with reasonable 
likelihood. 

Shor's crucial insight was that the discrete Fourier transform can be evaluated in poly- 
nomial time on a quantum computer. Indeed, its evaluation is remarkably efficient. With an 
improvement suggested by Coppersmith |]] and Deutsch || , evaluation of the L bit Fourier 
transform is accomplished by composing L one-qubit operations and \L(L — 1) two-qubit 
operations. (For the Cirac-Zoller device, implementation of the discrete Fourier transform 
requires L(2L — 1) distinct laser pulses.) 

The bottleneck of Shor's algorithm is the rather more mundane task of evaluating the 
modular exponential function, i.e., the implementation of the transformation U in Eq. ( [1.11) . 
This task demands far more computational resources than the rest of the algorithm, so we 
will focus on evaluation of this function in this paper. There is a well-known (classical) al- 
gorithm for evaluating the modular exponential that involves 0(i^ 3 ) elementary operations, 
and we will make use of this algorithm here. 

The main problem that commands our attention is the management of the "scratchpad" 
space that is needed to perform the computation; that is, the extra qubits aside from the 
input and output registers that are used in intermediate steps of the computation. It is 
essential to erase the scratchpad before performing the discrete Fourier transform on the 
input register. Before the scratchpad is erased, the state of the machine will be of the form 

\a)i\x a {mod N)) \g(a)) a , (1.3) 

" a 

where \g(a)) s denotes the "garbage" stored in the scratchpad. If we were now to perform 
the discrete Fourier transform on we would be probing the periodicity properties of the 
function x a (mod N) ®g(a), which may be quite different than the periodicity properties of 
x a {mod N) that we are interested in. Thus, the garbage in the scratchpad must be erased, 
but the erasure is a somewhat delicate process. To avoid destroying the coherence of the 
computation, erasure must be performed as a reversible unitary operation. 

In principle, reversible erasure of the unwanted garbage presents no difficulty Indeed, in 
his pioneering paper on reversible computation, Bennett || formulated a general strategy 
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for cleaning out the scratchpad: one can run the calculation to completion, producing the 
state Eq. (1.3), copy the result from the output register to another ancillary register, and 



then run the computation backwards to erase both the output register and the scratchpad. 
However, while this strategy undoubtedly works, it may be far from optimal, for it may 
require the scratchpad to be much larger than is actually necessary. We can economize on 
scratchpad space by running subprocesses backwards at intermediate stages of the compu- 
tation, thus freeing some registers to be reused in a subsequent process. (Indeed, Bennett 
himself [1(J described a general procedure of this sort that greatly reduces the space memory 
requirements.) However, for this reduction in required scratchpad space, we may pay a price 
in increased computation time. 

One of our objectives in this paper is to explore this tradeoff between memory require- 
ments and computation time. This tradeoff is a central general issue in quantum com- 
putation (or classical reversible computation) that we have investigated by studying the 
implementation of the modular exponential function, the bottleneck of Shor's factorization 
algorithm. We have constructed a variety of detailed quantum networks that evaluate the 
modular exponential, and we have analyzed the complexity of our networks. A convenient 
(though somewhat arbitrary) measure of the complexity of a quantum algorithm is the num- 
ber of laser pulses that would be required to implement the algorithm on a device like that 
envisioned by Cirac and Zoller. We show that if N and x are K-bit classical numbers and 
a is an L-bit quantum number, then, on a machine with 2K + 1 qubits of scratch space, 
the computation of x a (mod N) can be achieved with 198L [K 2 + 0(K)\ laser pulses. If the 
scratch space of the machine is increased by a single qubit, the number of pulses can be 
reduced by about 6% (for K large), and if K qubits are added, the improvement in speed is 
about 29%. We also exhibit a network that requires only K + 1 scratch qubits, but where 
the required number of pulses is of order LK A . 

The smallest composite number to which Shor's algorithm may be meaningfully applied 
is N=15. (The algorithm fails for N even and for N = p a , p prime.) Our general purpose 
algorithm (which works for any value of N), in the case N = 15 (or K = 4, L = 8), would 
require 21 qubits and about 15,000 laser pulses. In fact, a much faster special purpose 
algorithm that exploits special properties of the number 15 can also be constructed — for 
what it's worth, the special purpose algorithm could "factor 15" with 6 qubits and only 38 
pulses. 

The fastest modern digital computers have difficulty factoring numbers larger than about 
130 digits (432 bits). According to our estimates, to apply Shor's algorithm to a number 
of this size on the ion trap computer (or a machine of similar design) would require about 
2160 ions and 3 x 10 10 laser pulses. The ion trap is an intrinsically slow device, for the 
clock speed is limited by the frequency of the fundamental vibrational mode of the trapped 
ions. Even under very favorable conditions, it seems unlikely that more than 10 4 operations 
could be implemented per second. For a computation of practical interest, the run time of 
the computation is likely to outstrip by far the decoherence time of the machine. It seems 
clear that a practical quantum computer will require a much faster clock speed than can 
be realized in the Cirac-Zoller design. For this reason, a design based on cavity quantum 
electrodynamics (in which processing involves excitation of photons rather than phonons) 
|ll"|,|T2"H may prove more promising in the long run. 
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Whatever the nature of the hardware, it seems likely that a practical quantum computer 
will need to invoke some type of error correction protocol to combat the debilitating effects 
of decoherence ||13|| . Recent progress in the theory of error-correcting quantum codes fli 



has bolstered the hope that real quantum computers will eventually be able to perform 
interesting computational tasks . 

Although we expect that the linear ion trap is not likely to ever become a practical 
computer, we wish to emphasize that it is a marvelous device for the experimental studies 
of the peculiar properties of entangled quantum states. Cirac and Zoller || have already 
pointed out that maximally entangled states of n ions [T5j can be prepared very efficiently 



Since it is relatively easy to make measurements in the Bell operator basis for any pair of 
entangled ions in the trap 0], it should be possible to, say, demonstrate the possibility of 
quantum teleportation [|T7[] (at least from one end of the trap to the other). 

In Sec. II of this paper, we give a brief overview of the theory of quantum computation 
and describe Shor's algorithm for factoring. Cirac and Zoller's proposed implementation 
of a quantum computer using a linear ion trap is explained in Sec. III. Sec. IV gives a 
summary of the main ideas that guide the design of our modular exponentiation algorithms; 
the details of the algorithms are spelled out in Sec. V, and the complexity of the algorithms 
is quantified in Sec. VI. The special case N = 15 is discussed in Sec. VII. In Sec. VIII, we 
propose a simple experimental test of the quantum Fourier transform. Finally, in Appendix 
A, we describe a scheme for further improving the efficiency of our networks. 

Quantum networks that evaluate the modular exponential function have also been de- 
signed and analyzed by Despain fl8| , by Shor |19| and by Vedral, Barenco, and Ekert |20 



Our main results are in qualitative agreement with the conclusions of these authors, but the 
networks we describe are substantially more efficient. 



II. QUANTUM COMPUTATION AND SHOR'S FACTORIZATION ALGORITHM 

A. Computation and physics 

The theory of computation would be bootless if the computations that it describes could 
not actually be carried out using physically realizable devices. Hence it is really the task of 
physics to characterize what is computable, and to classify the efficiency of computations. 
The physical world is quantum mechanical. Therefore, the foundations of the theory of 
computation must be quantum mechanical as well. The classical theory of computation 
(e.g, the theory of the universal Turing machine) should be viewed as an important special 
case of a more general theory. 

A "quantum computer" is a computing device that invokes intrinsically quantum- 
mechanical phenomena, such as interference and entanglement .[] In fact, a Turing machine 



1 Fov a lucid review of quantum computation and Shor's algorithm, see 
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can simulate a quantum computer to any desired accuracy (and vice versa); hence, the clas- 
sical theory and the more fundamental quantum theory of computation agree on what is 
computable ||. But they may disagree on the classification of complexity; what is easy to 
compute on a quantum computer may be hard on a classical computer. 

B. Bits and qubits 

In classical theory, the fundamental unit of information is the bit — it can take either of 
two values, say and 1. All classical information can be encoded in bits, and any classical 
computation can be reduced to fundamental operations that flip bits (changing to 1 or 1 
to 0) conditioned on the values of other bits. 

In the quantum theory of information, the bit is replaced by a more general construct— 
the qubit. We regard |0) and |1) as the orthonormal basis states for a two-dimensional 
complex vector space. The state of a qubit (if "pure") can be any normalized vector, 
denoted 

c |0) + Cl |l), (2.1) 

where Co and c\ are complex numbers satisfying |co| 2 + |ci| 2 = 1. A classical bit can be 
viewed as the special case in which the state of the qubit is always either cq — 1, c\ = or 
c = 0, ci = 1. 

The possible pure states of a qubit can be parametrized by two real numbers. (The 
overall phase of the state is physically irrelevant.) Nevertheless, only one bit of classical 
information can be stored in a qubit and reliably recovered. If the value of the qubit in the 
state Eq. ( |2.1| ) is measured, the result is with probability |co| 2 and 1 with probability |ci| 2 ; 
in the case |co| 2 = |ci| 2 = |, the outcome of the measurement is a random number, and we 
recover no information at all. 

A string of n classical bits can take any one of 2 n possible values. For n qubits, these 2 n 
classical strings are regarded as the basis states for a complex vector space of dimension 2 n , 
and a pure state of n qubits is a normalized vector in this space. 

C. Processing 

In a quantum computation, n qubits are initially prepared in an algorithmically simple 
input state, such as 

|input) = |0)|0)|0)...|0) . (2.2) 

Then a unitary transformation U is applied to the input state, yielding an output state 

(output) = [/ 1 input) (2.3) 
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Finally, a set of commuting observables 0±, O2, O3, ... is measured in the output state. The 
measured values of these observables constitute the outcome of the computation. Since 
the output state is not necessarily an eigenstate of the measured observables, the quantum 
computation is not deterministic — rather, the same computation, performed many times, 
will generate a probability distribution of possible outcomes. 

(Note that the observables that are measured in the final step are assumed to be simple in 
some sense; otherwise the transformation U would be superfluous. Without loss of generality, 
we may specify that the values of all qubits (or a subset of the qubits) are measured at the 
end of the computation; that is, the jth qubit |-) . is projected onto the "computational 
basis" flO), , II),}.) 

To characterize the complexity of a computation, we must formulate some rules that 
specify how the transformation U is constructed. One way to do this is to demand that U is 
expressed as a product of elementary unitary transformations, or "quantum gates," that act 
on a bounded number of qubits (independent of n). In fact, it is not hard to see [f22[] that 



"almost any" two-qubit unitary transformation, together with qubit swapping operations, 
is universal for quantum computation. That is, given a generic 4x4 unitary matrix U, let 
fj(*J) denote U acting on the zth and jth qubits according to 

U {iti) -- MMih^U^JQM), (2.4) 

Then any 2™ x 2 n unitary transformation U can be approximated to arbitrary precision by 
a finite string of LA^'s, 

U ~ (j^T'jT) . . . fj{i2,h)(j{h,h) (2.5) 



The length T of this string (the "time") is a measure of the complexity of the quantum 
computation. 

Determining the precise string of C/^'s that is needed to perform a particular com- 
putational task may itself be computationally demanding. Therefore, to have a reasonable 
notion of complexity, we should require that a conventional computer (a Turing machine) 
generates the instructions for constructing the unitary transformation U. The complexity of 
the computation is actually the sum of the complexity of the classical computation and the 
complexity of the quantum computation. Then we may say that a problem is tractable on a 
quantum computer if the computation that solves the problem can be performed in a time 
that is bounded from above by a polynomial in n, the number of qubits contained in the 
quantum register. This notion of tractability has the nice property that it is largely inde- 
pendent of the details of the design of the machine — that is, the choice of the fundamental 
quantum gates. The quantum gates of one device can be simulated to polynomial accuracy 
in polynomial time by the quantum gates of another device. 

It is also clear that a classical computer can simulate a quantum computer to any desired 
accuracy — all that is required to construct the state | output) is repeated matrix multiplica- 
tion, and we can simulate the final measurement of the observables by expanding |output) 
in a basis of eigenstates of the observables. However, the classical simulation may involve 
matrices of exponentially large size (U is a 2" x 2 n matrix), and so may take an expo- 
nentially long time. It was this simple observation that led Feynman to suggest that 
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quantum computers may be able to solve certain problems far more efficiently than classical 
computers. 

D. Massive Parallelism 

Deutsch H put this suggestion in a more tangible form by emphasizing that a quantum 
computer can exploit "massive quantum parallelism." Suppose we are interested in studying 
the properties of a function / defined on the domain of nonnegative integers 0, 1, 2, . . . , 2 — 1. 
Imagine that a unitary transformation Uf can be constructed that efficiently computes /: 

Uf. |(i £ _ 1 i i _ 2 ...i 1 <o)>J(00...00)> OHt 

' — ► IO'i-iU-2 • • • iiio)> in |/(*L-i*L-2 • • • hk)) out ■ ( 2 - 6 ) 

Here (iL-iiz-2 ■ ■ ■ Mo) is an integer expressed in binary notation, and |(ir,-iir,-2 • • •'Mo)) 
denotes the corresponding basis state of L qubits. Since the function / might not be invert- 
ible, Uf has been constructed to leave the state in the \-) in register undisturbed, to ensure 
that it is indeed a reversible operation. 

Eq. ( p.6|) defines the action of Uf on each of 2 L basis states, and hence, by linear superpo- 
sition, on all states of a 2 L -dimensional Hilbert space. In particular, starting with the state 
|(00 . . . 00)) in , and applying single-qubit unitary transformations to each of the L qubits, we 
can easily prepare the state 

(t! |0) + ^) L= i t ^_ - EElMw • • • hio)) in 

= iEk. ( 2 - 7 ) 

L x=0 

an equally weighted coherent superposition of all of the 2 L distinct basis states. With this 
input, the action of Uf prepares the state 

\4>f) = iEV)J/WLr (2-8) 

z x=0 

The highly entangled quantum state Eq. (|2.8|) exhibits what Deutsch called "massive par- 
allelism." Although we have run the computation (applied the unitary transformation Uf) 
only once, in a sense this state encodes the value of the function / for each possible value of 
the input variable x. Were we to measure the value of all the qubits of the input register, 
obtaining the result subsequent measurement of the output register would reveal 

the value of f(a). Unfortunately, the measurement will destroy the entangled state, so the 
procedure cannot be repeated. We succeed, then, in unambiguously evaluating / for only a 
single value of its argument. 
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E. Periodicity 



Deutsch emphasized, however, that certain global properties of the function / can 
be extracted from the state Eq. ( |2.8| ) by making appropriate measurements. Suppose, for 
example, that / is a periodic function (defined on the nonnegative integers), whose period 
r is much less than 2 L (where r does not necessarily divide 2 L ), and that we are interested 
in finding the period. In general, determining r is a computationally difficult task (for a 
classical computer) if r is large. Shor's central observation is that a quantum computer, by 
exploiting quantum interference, can determine the period of a function efficiently. 

Given the state Eq. (|2.8| ), this computation of the period can be performed by manipu- 
lating (and ultimately measuring) only the state of the input register — the output register 
need not be disturbed. For the purpose of describing the outcome of such measurements, 
we may trace over the unobserved state of the output register, obtaining the mixed density 
matrix 

j r— 1 

Pinj = tr G ^ (IV/X-0/1) = -^MM , (2-9) 

r fc=0 



I N-l 



Wk) = - 7 =J2\x = k + rj) in (2.10) 



where 



AT J=0 

is the coherent superposition of all the input states that are mapped to a given output. 
(Here M — 1 is the greatest integer less than (2 L — k)/r.) 

Now, Shor showed that the unitary transformation 

FT: \ x )^-L 2 f^e^^ L \y) (2.11) 

(the Fourier transform) can be composed from a number of elementary quantum gates that 
is bounded from above by a polynomial in L. The Fourier transform can be used to probe 
the periodicity properties of the state Eq. ( |2.9|) . If we apply FT to the input register and 
then measure its value y, the outcome of the measurement is governed by the probability 
distribution 



_ ST „2niyrj/2 L 



2 

(2.12) 



This probability distribution is strongly peaked about values of y of the form 

1_ = ^ ±0{2 -L h (2 . 13) 

where the integer is a random number less than r. (For other values of y, the phases in the 
sum over j interfere destructively.) Now suppose that the period r is known to be less than 



9 



2 L / 2 . The minimal spacing between two distinct rational numbers, both with denominator 
less than 2 L ' 2 is 0(2~ L ). Therefore, if we measure y, the rational number with denominator 
less than 2 i//2 that is closest to y/2 L is reasonably likely to be a rational number with 
denominator r, where the numerator is a random number less than r. Finally, it is known 
that if positive integers r and s < r are randomly selected, then r and s will be relatively 
prime with a probability of order 1/ log log r. Hence, even after the rational number is 
reduced to lowest terms, it is not unlikely that the denominator will be r. 

We conclude then (if r is known to be less than 2 L//2 ), that each time we prepare the state 
Eq. ([2.8D , apply the FT to the input register, and then measure the input register, we have 
a probability of order 1 / log log r > 1/ log L of successfully inferring from the measurement 
the period r of the function /. Hence, if we carry out this procedure a number of times that 
is large compared to logL, we will find the period of / with probability close to unity. 

All that remains to be explained is how the construction of the unitary transformation 
FT is actually carried out. A simpler construction than the one originally presented by Shor 
|U was later suggested by Coppersmith and Deutsch ||. (It is, in fact, the standard fast 
Fourier transform, adapted for a quantum computer.) In their construction, two types of 
elementary quantum gates are used. The first type is a single-qubit rotation 

the same transformation that was used to construct the state Eq. (j2.7j) . The second type is 
a two-qubit conditional phase operation 

VM(6): 1^}^^%!^. (2.15) 

That is, V^' k ' (9) multiplies the state by the phase e td if both the jih and kth qubits have 
the value 1, and acts trivially otherwise. 

It is not difficult to verify that the transformation 

FT= {u^V^ l \n/2)V^ 2 \7r/A) ■ ■ ■ V^^in /2 L - 1 )} ■ • • 

• ■ • {^ r - 3 V( L - 3 - L - 2 )(7r/2)y^- 3 ' r - 1 J(7r/4)} • {u^V^ 2 '^^ /2)} ■ {U^} (2.16) 



acts as specified in Eq. (|2.11|) , except that the order of the qubits in y is reversed.^ (Here 



the transformation furthest to the right acts first.) We may act on the input register with 
FT rather than FT, and then reverse the bits of y after the measurement. Thus, the 
implementation of the Fourier transform is achieved by composing altogether L one-qubit 
gates and L(L — l)/2 two-qubit gates. 

Of course, in an actual device, the phases of the V^' k \9) gates will not be rendered with 
perfect accuracy. Fortunately, the peaking of the probability distribution in Eq. (|2.12|) is 



'For a lucid explanation, see [19 
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quite robust. As long as the errors in the phases occurring in the sum over j are small com- 
pared to 27r, constructive interference will occur when the condition Eq. ( 2.13Q is satisfied. 
In particular, the gates in Eq. ( |2.16| ) with small values of 9 = tt/2^~^ can be omitted, with- 
out much affecting the probability of finding the correct period of the function /. Thus (as 
Coppersmith J7| observed), the time required to execute the FT operation to fixed accuracy 
increases only linearly with L. 



F. Factoring 

The above observations show that a quantum computer can find the prime factors of 
a number efficiently, for it is well known that factoring can be reduced to the problem of 
finding the period of a function. Suppose we wish to find a nontrivial prime factor of the 
positive integer N. We choose a random number x < N. We can efficiently check, using 
Euclid's algorithm, whether x and N have a common factor. If so, we have found a factor 
of N, as desired. If not, let us compute the period of the modular exponential function 

f N , x (a) = x a (mod N) . (2.17) 

The period is the smallest positive r such that x r = 1 (mod N), called the order of x mod 
N. It exists whenever N and x < N have no common factor. 

Now suppose that r is even, and that x r l 2 — 1 (mod N). Then, since iV divides 
the product [x r l 2 + l) (x r ' 2 - l) = x r - 1, but does not divide either one of the factors 

[xfl 2 ± l), N must have a common factor with each of [xfl 2 ± l) . This common factor, a 
nontrivial factor of N, can then be efficiently computed. 

It only remains to consider how likely it is, given a random x relatively prime to N, that 
the conditions r even and x r ^ 2 ^ — 1 (mod) N are satisfied. In fact, it can be shown |T9|j21 



that, for N odd, the probability that these conditions are met is at least 1/2, except in the 
case where N is a prime power (N = p a , p prime). (The trouble with N = p a is that in this 
case ±1 are the only "square roots" of 1 in multiplication mod N, so that, even if r is even, 
x r l 2 ee — 1 (mod N) will always be satisfied.) Anyway, if N is of this exceptional type (or if 
TV is even), it can be efficiently factored by conventional (classical) methods. 

Thus, Shor formulated a probabilistic algorithm for factoring TV that will succeed with 
probability close to 1 in a time that is bounded from above by a polynomial in log N. To 
factor N we choose L so that, say, N 2 < 2 L < 2N 2 . Then, since we know that r < N < 2 L I 2 
we can use the method described above to efficiently compute the period r of the function 
In,x- We generate the entangled state Eq. fl2.8|) , apply the Fourier transform, and measure 
the input register, thus generating a candidate value of r. Then, a classical computer is 
used to find gcd(x r / 2 — 1, N). If there is a nontrivial common divisor, we have succeeded in 
finding a factor of N. If not, we repeat the procedure until we succeed. 

Of course, it is implicit in the above description that the evaluation of the function f^ tX 
can be performed efficiently on the quantum computer. The computational complexity of 
fff, x is, in fact, the main topic of this paper. 
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G. Outlook 



It is widely believed that no classical algorithm can factor a large number in polynomi- 
ally bounded time (though this has never been rigorously demonstrated). The existence of 
Shor's algorithm, then, indicates that the classification of complexity for quantum compu- 
tation differs from the corresponding classical classification. Aside from being an interesting 
example of an intrinsically hard problem, factoring is also of some practical interest — the 
security of the widely used RSA public key cryptography scheme |23| relies on the presumed 
difficulty of factoring large numbers. 

It is not yet known whether a quantum computer can efficiently solve "NP-complete" 
problems, which are believed to be intrinsically more difficult than the factoring problem. 
(The "traveling salesman problem" is a notorious example of an NP-complete problem.) 
It would be of great fundamental interest (and perhaps of practical interest) to settle this 
question. Conceivably, a positive answer could be found by explicitly exhibiting a suitable 
algorithm. In any event, better characterizing the class of problems that can be solved in 
"quantum polynomial time" is an important unsolved problem. 

The quantum factoring algorithm works by coherently summing an exponentially large 
number of amplitudes that interfere constructively, building up the strong peaks in the 
probability distribution Eq. ( |2.12p . Unfortunately, this "exponential coherence" is extremely 
vulnerable to the effects of noise When the computer interacts with its environment, the 
quantum state of the computer becomes entangled with the state of the environment; hence 
the pure quantum state of the computer decays to an incoherent mixed state, a phenomenon 
known as decoherence. Just as an illustration, imagine that, after the coherent superposition 
state Eq. ( |2.10| ) is prepared, each qubit has a probability p « 1 of decohering completely 



before the FT is applied and the device is measured; in other words, pL of the L qubits 
decohere, and the state of the computer becomes entangled with 2 pL mutually orthogonal 
states of the environment. Thus, the number of terms in the coherent sum in Eq. (|2.12|) is 
reduced by the factor 2~ pL , and the peaks in the probability distribution are weakened by 
the factor 2~ 2pL . For any nonzero p, then, the probability of successfully finding a factor 
decreases exponentially as L grows large. 

Interaction with the environment, and hence decoherence, always occur at some level. 
It seems then, that the potential of a quantum computer to solve hard problems efficiently 
can be realized only if suitable schemes are found that control the debilitating effects of 
decoherence. In some remarkable recent developments []14 |, clever error correction schemes 



have been proposed for encoding and storing quantum information that sharply reduce its 
susceptibility to noise. Some remaining challenges are: to incorporate error correction into 
the operation of a quantum network (so that it can operate with high reliability in spite 
of the effects of decoherence), and to find efficient error- correct ion schemes that can be 
implemented in realistic working devices. 



III. THE LINEAR ION TRAP 
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A. A realizable device 



The hardware for a quantum computer must meet a variety of demanding criteria. A 
suitable method for storing qubits should be chosen such that: (1) the state of an individual 
qubit can be controlled and manipulated, (2) carefully controlled strong interactions between 
distinct qubits can be induced (so that nonlinear logic gates can be constructed), and (3) 
the state of a qubit can be read out efficiently Furthermore, to ensure effective operation: 
(1) the storage time for the qubits must be long enough so that many logical operations can 
be performed, (2) the machine should be free of imperfections that could introduce errors 
in the logic gates, and (3) the machine should be well isolated from its environment, so that 
the characteristic decoherence time is sufficiently long. 

Cirac and Zoller || proposed an incarnation of a quantum computer that meets these 
criteria remarkably well and that may be within the grasp of existing technology. In their 
proposal, ions are collected in a linear harmonic trap. The internal state of each ion encodes 
one qubit: the ground state \g) is interpreted as |0), and a long-lived metastable excited state 
|e) is interpreted as |1). The quantum state of the computer in this basis can be efficiently 
read out by the "quantum jump method" |24|]. A laser is tuned to a transition from the 



state \g) to a short-lived excited state that decays back to \g); when the laser illuminates 
the ions, each qubit with value |0) fluoresces strongly, while the qubits with value |1) remain 
dark. 

Coulomb repulsion keeps the ions sufficiently well separated that they can be individually 
addressed by pulsed lasers 0. If a laser is tuned to the frequency u, where %uo is the energy 
splitting between \g) and |e), and is focused on the the ith ion, then Rabi oscillations are 
induced between |0) j and \l) v By timing the laser pulse properly, and choosing the phase of 
the laser appropriately, we can prepare the ith ion in an arbitrary superposition of |0)^ and 
(Of course, since the states \g) and |e) are nondegenerate, the relative phase in this 
linear combination rotates with time as e~ tuJt even when the laser is turned off. It is most 
convenient to express the quantum state of the qubits in the interaction picture, so that this 
time-dependent phase is rotated away.) 

Crucial to the functioning of the quantum computer are the quantum gates that induce 
entanglement between distinct qubits. The qubits must interact if nontrivial quantum gates 
are to be constructed. In the ion trap computer, the interactions are effected by the Coulomb 
repulsion between the ions. Because of the mutual Coulomb repulsion, there is a spectrum 
of coupled normal modes for the ion motion. When an ion absorbs or emits a laser photon, 
the center of mass of the ion recoils. But if the laser is properly tuned, then when a single 
ion absorbs or emits, a normal mode involving many ions will recoil coherently (as in the 
Mossbauer effect). 

The vibrational mode of lowest frequency (frequency v) is the center-of-mass (CM) mode, 
in which the ions oscillate in lockstep in the harmonic well of the trap. The ions can be 
laser cooled to a temperature much less than u, so that each vibrational normal mode is 
very likely to occupy its quantum-mechanical ground state. Now imagine that a laser tuned 
to the frequency uj — v shines on the ith ion. For a properly timed pulse (a n pulse, or a kn 
pulse for k odd), the state \e) i will rotate to \g) { , while the CM oscillator makes a transition 
from its ground state |0) CM to its first excited state |1) CM (a CM "phonon" is produced). 
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However, the state | 1 0) C]vr is not on resonance for any transition, and so is unaffected by 
the pulse. Thus, with a single laser pulse, we may induce the unitary transformation 

phon ' 1 |e)jO) CM ^-z |^|1> CM J [ } 

This operation removes a bit of information that is initially stored in the internal state of 
the ith ion, and deposits the bit in the CM phonon mode. Applying W$ on again would 
reverse the operation (up to a phase), removing the phonon and reinstating the bit stored 
in ion i. However, all of the ions couple to the CM phonon, so that once the information 
has been transferred to the CM mode, this information will influence the response of ion j 
if a laser pulse is subsequently directed at that ion. By this scheme, nontrivial logic gates 
can be constructed, as we will describe in more detail below. 

An experimental demonstration of an operation similar to Wp^ OI1 was recently carried out 
by Monroe et al. [f^J. In this experiment, a single 9 Be + ion occupied the trap. In earlier 
work, a linear trap was constructed that held 33 ions, but these were not cooled down to the 
vibrational ground state. The effort to increase the number of qubits in a working device is 
ongoing. 

Perhaps the biggest drawback of the ion trap is that it is an intrinsically slow device. Its 
speed is ultimately limited by the energy-time uncertainty relation; since the uncertainty in 
the energy of the laser photons should be small compared to the characteristic vibrational 
splitting u, the pulse must last a time large compared to v~ x . In the Monroe et al. experi- 
ment, v was as large as 50 MHz, but it is likely to be orders of magnitude smaller in a device 
that contains many ions. 

In an alternate version of the above scheme (proposed by the Pellizzari et al. |12|) many 



atoms are stored in an optical cavity, and the atoms interact via the cavity photon mode 
(rather than the CM vibrational mode). In principle, quantum gates in a scheme based on 
cavity QED could be intrinsically much faster than gates implemented in an ion trap. An 
experimental demonstration of a rudimentary quantum gate involving photons interacting 



with an atom in a cavity was recently reported by Turchette et al. ITT]] 



B. Conditional phase gate 



An interesting two-qubit gate can be constructed by applying three laser pulses |J. After 
a phonon has been (conditionally) excited, we can apply a laser pulse to the jth ion that is 
tuned to the transition \g)j\t) CM 1 — > \ e ')j\tycw wnere \ e> ) is another excited state (different 
than |e)) of the ion. The effect of a 27r pulse is to induce the transformation 



y(j) 



l#>il°>CM 
l e )il°)cM 
l^il^CM 



CM 



lff>i|0> 

l e )il°)cM 
l^il^CM 
l^il^CM 



(3.2) 



Only the phase of the state !<?),,• |1) CM is affected by the 2n pulse, because this is the only 
state that is on resonance for a transition when the laser is switched on. (It would not have 
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had the same effect if we had tuned the laser to the transition from |g)|l) CM to |e)|0) CM , 
because then the state |e)|0) CM would also have been modified by the pulse.) Applying 
^phon again removes the phonon, and we find that 

V™ = W% a ■ V® ■ W% n : \e) i \ V ) j {-lV\e) i \r ] ) j . (3.3) 

is a conditional phase gate; it multiplies the quantum state by (—1) if the qubits \ and |-) . 
both have the value 1, and acts trivially otherwise. A remarkable and convenient feature of 
this construction is that the two qubits that interact need not be in neighboring positions in 
the linear trap. In principle, the ions on which the gate acts could be arbitrarily far apart. 

This gate can be generalized so that the conditional phase (—1) is replaced by an arbitrary 
phase e %e — we replace the 2ir pulse directed at ion j by two 7r pulses with differing values of 
the laser phase, and modify the laser phase for one of ir pulses directed at ion i. Thus, with 
4 pulses, we construct the conditional phase transformation V^^\0) defined in Eq. (|2.15 ) 



that is needed to implement the Fourier transform FT. The L-qubit Fourier transform, 
then, requiring L(L — l)/2 conditional phase gates and L single-qubit rotations, can be 
implemented with altogether L(2L — 1) laser pulses. 

Actually, we confront one annoying little problem when we attempt to implement the 
Fourier transform. The single-qubit rotations that can be simply induced by shining the 
laser on an ion are unitary transformations with determinant one (the exponential of an off- 
diagonal Hamiltonian) , while the rotation defined in Eq. ( 2.14j) actually has determinant 



[— 1). We can replace in the construction of the FT operator (Eq. ( [2.16] )) by the 
transformation 



• liv v^v-i U Mi), 



(3.4) 



(which can be induced by a single laser pulse with properly chosen laser phase). However, 
the transformation FT thus constructed differs from FT according to 

(y\FT\x) = (-l) Par(y) (y\FT\x) (3.5) 

where Par(y) is the parity of y, the number of l's appearing in its binary expansion. Fortu- 
nately, the additional phase Pax(y) has no effect on the probability distribution Eq. ( p,12|) , 



so this construction is adequate for the purpose of carrying out the factorization algorithm. 



C. Controlled fc -NOT gate 



The conditional (—1) phase gate Eq. (|3.3|) differs from a controlled-NOT gate by a mere 
change of basis 0. The controlled-NOT operation Cpy acts as 

<%; : |e> i to> i '— HeJitoee),. , (3.6) 

where © denotes the logical XOR operation (binary addition mod 2). Thus Cpy flips the 
value of the target qubit |-) . if the control qubit |-^ has the value 1, and acts trivially 
otherwise. We see that the controlled-NOT can be constructed as 
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c. 



' x -w^t -V U) -W [ h 

" ' phon Y " ' phon 



r(i) 



(3.7) 



where C/^ is the single-qubit rotation defined in Eq. (|3.4|) . Since i/^') (or its inverse) can 
be realized by directing a it/2 pulse at ion j, we see that the controlled-NOT operation can 
be implemented in the ion trap with altogether 5 laser pulses. 

The controlled-NOT gate can be generalized to an operation that has a string of k control 
qubits; we will refer to this operation as the controlled fc -NOT operation. (For k — 2, it is 
often called the Toffoli gate.) Its action is 



h) h ■ ■ • l e fc>ij e >j 1 > l e l>n 



(ei A • • • A e k )). 



(3.8) 



where A denotes the logical AND operation (binary multiplication). If all k of the control 
qubits labeled ii, . . . ,ik take the value 1, then C|i lr .. ) j fc jj flips the value of the target qubit 
labeled j; otherwise, Ciii,...,i fc ],j acts trivially. To implement this gate in the ion trap, we will 
make use of an operation V^ on that is induced by directing a ir pulse at ion i tuned to the 



transition |<?)Jl) 



CM 



|e%|0> 



CM' 



its action is 



V, 



(0 

phon 



l^|0) CM h 

l e )J°)cM H 
\9)i\l)ctA h 



—I 



\9)M CM 

l e )J°) C M 

|eO<|0>CM 

l e )jl^)cM 



(3.9) 



The pulse has no effect unless the initial state is |5 i )J1)cm' ^ n wm ch case the phonon is 
absorbed and ion i undergoes a transition to the state \e') v We thus see that the controlled fc - 
NOT gate can be constructed as || 



C, 



(jU) 



(<l) 



phon phon 



(m) 



phon 



(**) 
phon 



(fa) 



phon 



(<l) 

phon 



■U U) . (3.10) 



To understand how the construction works, note first of all that if e\ = 0, no phonon is ever 
excited and none of the pulses have any effect. If e\ — e 2 = • • • = e m -i — 1 an d e m = 
(m < fc), then the first produces a phonon that is absorbed during the first V^oi 

operation, reemited during the second operation, and finally absorbed again during 

the second W^ n ; the other pulses have no effect. Since each of the four pulses that is 
on resonance advances the phase of the state by 7r/2, there is no net change of phase. If 
e\ = e 2 = • • • = 6^ = 1, then a phonon is excited by the first W^ n , and all of the K^-'s 
act trivially; hence in this case C|j li ... j j fe y has the same action as Cp^j. 

We find then, that the controlled fc -NOT gate (k = 1, 2, . . .) can be implemented in the 
ion trap with altogether 2k + 3 laser pulses. These gates are the fundamental operations 
that we will use to build the modular exponential function.^ 



3 In fact, the efficiency of our algorithms could be improved somewhat if we adopted other fun- 
damental gates that can also be simply implemented with the ion trap. Implementations of some 
alternative gates are briefly discussed in Appendix A. 
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IV. MODULAR EXPONENTIATION: SOME GENERAL FEATURES 



In the next section, we will describe in detail several algorithms for performing modular 
exponentiation on a quantum computer. These algorithms evaluate the function 



where N and x are K-bit classical numbers (c-numbers) and a is an L-qubit quantum number 
(q-number). Our main motivation, of course, is that the evaluation of Jn,x is the bottleneck 
of Shor's factorization algorithm. 

Most of our algorithms require a "time" (number of elementary quantum gates) of or- 
der K 2, for large K. In fact, for asymptotically large K, faster algorithms (time of order 
K 2 log(K) loglog(-ft')) are possible — these take advantage of tricks for performing efficient 
multiplication of very large numbers [p5|l . We will not consider these asymptotically faster 
algorithms in any detail here. Fast multiplication requires additional storage space. Fur- 
thermore, because fast multiplication carries a high overhead cost, the advantage in speed 
is realized only when the numbers being multiplied are enormous. 

We will concentrate instead on honing the efficiency of algorithms requiring K 3 time, and 
will study the tradeoff of computation time versus storage space for these algorithms. We 
will also briefly discuss an algorithm that takes considerably longer (K 5 time), but enables 
us to compress the storage space further. 

Finally, we will describe a "customized" algorithm that is designed to evaluate f^,x 
in the case N = 15, the smallest value of iV for which Shor's algortihm can be applied. 
Unsurprisingly, this customized algorithm is far more efficient, both in terms of computation 
time and memory use, than our general purpose algorithms that apply for any value of N 
and x. 



A classical computer and a quantum computer: The machine that runs our pro- 
gram can be envisioned as a quantum computer controlled by a classical computer. The 
input that enters the machine consists of both classical data (a string of classical bits) and 
quantum data (a string of qubits prepared in a particular quantum state). The classical 
data takes a definite fixed value throughout the computation, while for the quantum data 
coherent superpositions of different basis states may be considered (and quantum entangle- 
ment of different qubits may occur). The classical computer processes the classical data, 
and produces an output that is a program for the quantum computer. 

The quantum computer is a quantum gate network of the sort described by Deutsch 
@. The program prepared by the classical computer is a list of elementary unitary trans- 
formations that are to be applied sequentially to the input state in the quantum register. 
(Typically, these elementary transformations act on one, two, or three qubits at a time; 
their precise form will vary depending on the design of the quantum computer.) Finally, the 




(4.1) 



A. The model of computation 
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classical computer calls a routine that measures the state of a particular string of qubits, 
and the result is recorded. The result of this final measurement is the output of our device. 

This division between classical and quantum data is not strictly necessary. Naturally, 
a c-number is just a special case of a g-number, so we could certainly describe the whole 
device as a quantum gate network (though of course, our classical computer, unlike the 
quantum network, can perform irreversible operations). However, if we are interested in 
how a practical quantum computer might function, the distinction between the quantum 
computer and the classical computer is vitally important. In view of the difficulty of building 
and operating a quantum computer, if there is any operation performed by our device that is 
intrinsically classical, it will be highly advantageous to assign this operation to the classical 
computer; the quantum computer should be reserved for more important work. (This is 
especially so since it is likely to be quite a while before a quantum computer's "clock speed" 
will approach the speed of contemporary classical computers.) 

Counting operations: Accordingly, when we count the operations that our algorithms 
require, we will be keeping track only of the elementary gates employed by the quantum 
computer, and will not discuss in detail the time required for the classical computer to process 
the classical data. Of course, for our device to be able to perform efficient factorization, the 
time required for the classical computation must be bounded above by a polynomial in K. 
In fact, the classical operations take a time of order K 3 ; thus, the operation of the quantum 
computer is likely to dominate the total computation time even for a very long computation.^ 

In the case of the evaluation of the modular exponential function /^(a), the classical 
input consists of N and x, and the quantum input is a stored in the quantum register; in 
addition, the quantum computer will require some additional qubits (initially in the state 
|0)) that will be used for scratch space. The particular sequence of elementary quantum 
gates that are are applied to the quantum input will depend on the values of the classical 
variables. In particular, the number of operations is actually a complicated function of N 
and x. For this reason, our statements about the number of operations performed by the 
quantum computer require clarification. 

We will report the number of operations in two forms, which we will call the "worst case" 
and the "average case." Our classical computer will typically compute and read a particular 
classical bit (or sequence of bits) and then decide on the basis of its value what operation 
to instruct the quantum computer to perform next. For example, the quantum computer 
might be instructed to apply a particular elementary gate if the classical bit reads 1, but 
to do nothing if it reads 0. To count the number of operations in the worst case, we will 
assume that the classical control bits always assume the value that maximizes the number 
of operations performed. This worst case counting will usually be a serious overestimate. 
A much more realistic estimate is obtained if we assume that the classical control bits are 



Indeed, one important reason that we insist that the quantum computer is controlled by a classical 
computer is that we want to have an honest definition of computational complexity; if it required 
an exponentially long classical computation to figure out how to program the quantum computer, 
it would be misleading to say that the quantum computer could solve a problem efficiently. 
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random (0 50% of the time and 1 50% of the time). This is how the "average case" estimate 
is arrived at. 

The basic machine and the enhanced machine: Our quantum computer can be 
characterized by the elementary quantum gates that are "hard-wired" in the device. We will 
consider two different possibilities. In our "basic machine" the elementary operations will 
be the single-qubit NOT operation, the two-qubit controlled-NOT operation, and the three- 
qubit controlled-controlled-NOT operation (or Toffoli gate). These elementary gates are not 
computationally universal (we cannot construct arbitrary unitary operations by composing 
them), but they will suffice for our purposes; our machine won't need to be able to do 
anything else|] Our "enhanced machine" is equipped with these gates plus two more — a 
4-qubit controlled 3 -NOT gate and a 5-qubit controlled 4 -NOT gate. 

In fact, the extra gates that are standard equipment for the enhanced machine can be 
simulated by the basic machine. However, this simulation is relatively inefficient, so that it 
might be misleading to quote the number of operations required by the basic machine when 
the enhanced machine could actually operate much faster. In particular, Cirac and Zoller 
described how to execute a controlled fc -NOT (k > 1) operation using 2k + 3 laser pulses in 
the linear ion trap; thus, e.g., the controlled 4 -NOT operation can be performed much more 
quickly in the ion trap than if it had to be constructed from controlled fc -NOT gates with 
k = 0,1,2. 

To compare the speed of the basic machine and the enhanced machine, we must assign a 
relative cost to the basic operations. We will do so by expressing the number of operations 
in the currency of laser pulses under the Cirac-Zoller scheme: 1 pulse for a NOT, 5 for a 
controlled-NOT, 7 for a controlled 2 -NOT, 9 for a controlled 3 -NOT, and 11 for a controlled 4 - 
NOT. We realize that this measure of speed is very crude. In particular, not all laser pulses 
are really equivalent. Different pulses may actually have differing frequencies and differing 
durations. Nevertheless, for the purpose of comparing the speed of different algorithms, we 
will make the simplifying assumption that the quantum computer has a fixed clock speed, 
and administers a laser pulse to an ion in the trap once in each cycle. 

The case of the (uncontrolled) NOT operation requires special comment. In the Cirac- 
Zoller scheme, the single qubit operations always are 2x2 unitary operations of determinant 
one (the exponential of an off-diagonal 2x2 Hamiltonian) . But the NOT operation has 
determinant (—1). A simple solution is to use the operation z-(NOT) instead (which does 
have determinant 1 and can be executed with a single laser pulse). The overall phase (i) has 
no effect on the outcome of the computation. Hence, we take the cost of a NOT operation 
to be one pulse. 

In counting operations, we assume that the controlled fc -NOT operation can be performed 
on any set of k + 1 qubits in the device. Indeed, a beautiful feature of the Cirac-Zoller 
proposal is that the efficiency of the gate implementation is unaffected by the proximity of 
the ions. Accordingly, we do not assign any cost to "swapping" the qubits before they enter 



'That is, these operations suffice for evaluation of the modular exponential function. Other gates 



will be needed to perform the discrete Fourier transform, as described in Sec. HE 
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a quantum gate.Q 



B. Saving space 

A major challenge in programming a quantum computer is to minimize the "scratchpad 
space" that the device requires. We will repeatedly appeal to two basic tricks (both originally 
suggested by C. Bennett [§,0) to make efficient use of the available space. 

Erasing garbage: Suppose that a unitary transformation F is constructed that com- 
putes a (not necessarily invertible) function / of a g-number input b. Typically, besides 
writing the result f(b) in the output register, the transformation F will also fill a portion of 
the scratchpad with some expendable garbage g(b); the action of F can be expressed as 

F aAl : |&) a |O>0|O> 7 i— |6>«|/(6)>/jb(6)> 7 , (4-2) 

where |-) a , |-) 7 denote the input, output, and scratch registers, respectively. Before 
proceeding to the next step of the computation, we would like to clear g(b) out of the 
scratch register, so that the space |-) 7 can be reused. To erase the garbage, we invoke a 
unitary operation COPY a g that copies the contents of \-) a to an additional register and 
then we apply the inverse F _1 of the unitary operation F. Thus, we have 

XF aM = F~} tl ■ COPY a , 5 ■ F aAl : |6} a |0)^|0} 7 |0), |&>«|0>/j|0) 7 |/(&)>, . (4.3) 

The composite operation XF uses both of the registers j-)^ and |-} 7 as scratch space, but it 
cleans up after itself. Note that XF preserves the value of b in the input register. This is 
necessary, for a general function /, if the operation XF is to be invertible. 

Overwriting invertible functions: We can clear even more scratch space in the special 
case where / is an invertible function. In that case, we can also construct another unitary 
operation XFI that computes the inverse function that is, 

XFI a ^:\b) a \Q)^\b) a \r\b)) p . (4.4) 

or, equivalent ly, 

XFIfr : \0) a \f(b)) p — ► \b) a \f(b)) fi • (4.5) 

(XFI, like XF, requires scratchpad space. But since XFI, like XF, leaves the state of the 
scratchpad unchanged, we have suppressed the scratch registers in Eq. ( |4.4| ) and Eq. (|4.5|) .) 
By composing XF and XFI~ l , we obtain an operation OF that evaluates the function f(b) 
and "overwrites" the input b with the result f(b): 



6 For a different type of hardware, such as the device envisioned by Lloyd, |26|| , swapping of qubits 
would be required, and the number of elementary operations would be correspondingly larger. 
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OF a , p = XFI^ a -XF a , p :\b) a \0)^ 



|0>«|/(6)>/9 • 



(4.6) 



(Strictly speaking, this operation does not "overwrite" the input; rather, it erases the input 
register |-) Q and writes f(b) in a different register A genuinely overwriting version of 
the evaluation of of / can easily be constructed, if desired, by following OF with a unitary 
SWAP operation that interchanges the contents of the \-) a and registers. Even more 
simply, we can merely swap the labels on the registers, a purely classical operation.) 

In our algorithms for evaluating the modular exponentiation function, the binary arith- 
metic operations that we perform have one classical operand and one quantum operand. 
For example, we evaluate the product y ■ b (mod N), where y is a c-number and b is a 
q- number. Evaluation of the product can be viewed as the evaluation of a function f y (b) 
that is determined by the value of the c-number y. Furthermore, since the positive integers 
less than N that are relatively prime to N form a group under multiplication, the function 
f y is an invertible function if gcd(y, N) = 1. Thus, for gcd(y, N) = 1, we can (and will) use 
the above trick to overwrite the q- number b with a new q- number y ■ b (modiV). 



The basic arithmetic operation that we will need to perform is addition (mod N) — we 
will evaluate y + b (mod N) where y is a c-number and b is a g-number. The most efficient 
way that we have found to perform this operation is to build a multiplexed mod N adder. 

Suppose that iV is a K-bit c-number, that y is a K-bit c-number less than N, and that 
b is a A'-qubit g-number, also less than N. Evaluation of y + b (mod N) can be regarded as 
a function, determined by the c-number y, that acts on the g-number b. This function can 
be described by the "pseudo-code" 



Our multiplexed adder is designed to evaluate this function. First a comparison is made 
to determine if the c-number iV — y is greater than the g-number b, and the result of the 
comparison is stored as a "select qubit." The adder then reads the select qubit, and performs 
an "overwriting addition" operation on the the g-number b, replacing it by either y + b (for 
N — y > b), or y + b — N (for N — y < b). Finally, the comparison operation is run backwards 
to erase the select qubit. 

Actually, a slightly modified version of the above pseudo-code is implemented. Since it 
is a bit easier to add a positive c-number than a negative one, we choose to add 2 K + y — N 
to b for N — y < b. The (K + l)st bit of the sum (which is guaranteed to be 1 in this case), 
need not be (and is not) explicitly evaluated by the adder. 



C. Multiplexed Adder 



if (N - y > b) ADD y , 

if (N — y <b) ADD y-N . 



(4.7) 
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D. Enable bits 

Another essential feature of our algorithms is the use of "enable" qubits that control 
the arithmetic operations. Our multiplexed adder, for example, incorporates such an enable 
qubit. The adder reads the enable qubit, and if it has the value 1, the adder replaces the 
input g-number b by the sum y + b (mod N) (where y is a c- number). If the enable qubit 
has the value 0, the adder leaves the input g-number b unchanged. 

Enable qubits provide an efficient way to multiply a g-number by a c-number. A fT-qubit 
g-number b can be expanded in binary notation as 

K-l 

b = £ k? , 

i=0 

and the product of b and a c-number y can be expressed as 

K-l 

b ■ y (mod N) = Y, b i- ( mod N )] 

i=0 

This product can be built by running the pseudo-code: 

For % = to K - 1 , if b i = I , ADD Ty (mod N) ; (4.10) 

multiplication is thus obtained by performing K conditional mod N additions. Hence our 
multiplication routine calls the multiplexed adder K times; in the ith call, 6; is the enable 
bit that controls the addition. 

In fact, to compute the modular exponential function as described below, we will need 
conditional multiplication; the multiplication routine will have an enable bit of its own. Our 
multiplier will replace the q- number b by the product b-y (mod N) (where y is a c-number) if 
the enable qubit reads 1, and will leave b unchanged if the enable qubit reads 0. To construct 
a multiplier with an enable bit, we will need an adder with a pair of enable bits — that is, 
an adder that is switched on only when both enable qubits read 1. 

The various detailed algorithms that we will describe differ according to how enable 
qubits are incorporated into the arithmetic operations. The most straightforward procedure 
(and the most efficient, in the linear ion trap device of Cirac and Zoller) is that underlying 
the design of our "enhanced machine." We will see that a multiplexed adder can be con- 
structed from the elementary gates NOT, controlled-NOT and controlled 2 -NOT. One way to 
promote this adder to an adder with two enable bits is to replace each controlled fc -NOT by 
a controlled( fc+2 )-NOT, where the two enable bits are added to the list of control bits in each 
elementary gate. We thus construct a routine that performs (multiplexed) addition when 
both enable bits read 1, and does nothing otherwise. The routine is built from elementary 
controlled fc -NOT gates with k = 4 or less. 

In fact, it will turn out that we will not really need to add enable bits to the control list of 
every gate. But following the above strategy does require controlled^ gates for A;=0, 1,2,3,4. 
This is how our enhanced machine performs mod N addition with two enable bits (and mod 
N multiplication with one enable bit). 
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(4.8) 



(4.9) 



Because controlled 4 -NOT and controlled 3 -NOT gates are easy to implement on the linear 
ion trap, the above procedure is an efficient way to compute the modular exponential function 
with an ion trap. However, for a different type of quantum computing hardware, these 
elementary gates might not be readily constructed. Therefore, we will also consider a few 
other algorithms, which are built from elementary controlled fc -NOT gates for only k — 0, 1, 2. 
These algorithms for our "basic machine" follow the same general design as the algorithm for 
the "enhanced machine," except that the controlled 3 -NOT and the controlled 4 -NOT gates 
are expanded out in terms of the simpler elementary operations. (The various algorithms 
for the basic machine differ in the amount of scratch space that they require.) 

E. Repeated squaring 

One way to evaluate the modular exponential x a (mod N) is to multiply by a; a total of 
a — 1 times, but this would be terribly inefficient. Fortunately, there is a well-known trick, 
repeated squaring, that speeds up the computation enormously. 

If a is an L-bit number with the binary expansion Yh=® a i^\ we n °te that 

a *= x (T£ i «*) = nVy* . (4.H) 

i=0 

Furthermore, since 

x* = (x*- 1 ) 2 , (4.12) 

we see that x T (mod N), can be computed by squaring x T \ We conclude that x a (mod N) 
can be obtained from at most 2(L — 1) mod N multiplications (fewer if some of the a, 
vanish). If ordinary "grade school" multiplication is used (rather than a fast multiplication 
algorithm), this evaluation of x a (mod N) requires of order L ■ K 2 elementary bit operations 
(where N and x < N are K-bit numbers). Our algorithms for evaluating x a , where a is an 
L-bit g-number and x is a K-bit c-number, are based on "grade school" multiplication, and 
will require of order L ■ K 2 elementary quantum gates. 

Since a; is a c-number, the "repeated squaring" to evaluate x T (mod N) can be performed 
by our classical computer. Once these c-numbers are calculated and stored, then x a (mod N) 
can be found by running the pseudo-code 

For i = to L-1 , ifOi = l, MULTIPLY x 2 ' (mod N) . (4.13) 

Thus, the modular exponential function is obtained from L conditional multiplications. It is 
for this reason that our mod N multiplier comes equipped with an enable bit. Our modular 
exponentiation algorithm calls the mod iV multiplier L times; in the ith call, Oj_i is the 
enable bit that controls the multiplication. 
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V. MODULAR EXPONENTIATION IN DETAIL 



A. Notation 

Having described above the central ideas underlying the algorithms, we now proceed to 
discuss their detailed implementation. We will be evaluating x a (mod N), where N is a 
K-bit c-number, x is a i^-bit c-number less than N, and a is an L-bit (/-number. For the 
factorization algorithm, we will typically choose L m 2K. 

We will use the ket notation | -) to denote the quantum state of a single qubit, a two-level 
quantum system. The two basis states of a qubit are denoted |0) and |1). Since most of 
the g-numbers that will be manipulated by our computer will be K qubits long, we will 
use a shorthand notation for K-qubit registers; such registers will be denoted by a ket that 
carries a lowercase Greek letter subscript, e.g., \b) a , where b is a K-bit string that represents 
the number Y^=q X h^ 1 m binary notation. Single qubits are denoted by kets that carry a 
numeral subscript, e.g |c) 1; where c is or 1. Some registers will be L bits long; these will 
be decorated by asterisk superscripts, e.g. \a)* a . 

The fundamental operation that our quantum computer performs is the controlled fc -NOT 
operation. This is the (k + l)-qubit quantum gate that acts on a basis according to 

C[u,...,i fc y : \ei) h • • • |e fc > Je^. i — ► \e 1 ) h ■ ■ ■ |e fc ) Je © (ei A • • • A e k )) J . (5.1) 

Here, each of ei, ...,£&, e takes the value or 1, A denotes the logical AND operation (binary 
multiplication) and © denotes the logical XOR operation (binary addition mod 2). Thus, 
the gate C[ii,...,t*],j acts on k "control" qubits labeled ii, . . . ,ik and on one "target qubit" 
labeled j. If all k of the control qubits take the value 1, then Cp^.^jj flips the value 
of the target qubit; otherwise, Cin,...,^]^- acts trivially. In order to represent our quantum 
circuits graphically, we will use Feynman's notation for the controlled fc -NOT, shown in Fig. 
[l]. Note that it ^- = C[H,...,i fe ]j, so a computation composed of controlled fc -NOT's can 

be inverted by simply executing the controlled fc -NOT's in the reverse order. 

As we explained above, our "basic machine" comes with the NOT, controlled-NOT and 
controlled 2 -NOT gates as standard equipment. Our enhanced machine is equipped with 
these fundamental gates and, in addition, the controlled 3 -NOT and controlled 4 -NOT gates. 

B. Addition 

From the controlled fc -NOT gates, we can build (reversible) arithmetic operations. The 
basic operation in (classical) computer arithmetic is the full adder. Given two addend bits 
a and b, and an input carry bit c, the full adder computes the the sum bit 

s = a©6©c (5.2) 

and the output carry bit 
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c > = (a A 6) V (c A (a V 6)) 



(5.3) 



The addition that our quantum computer performs always involves adding a c-number to 
a g-number. Thus, we will use two different types of quantum full adders, distinguished by 
the value of the classical addend bit. To add the classical bit a = 0, we construct 

FA(a = 0)i,2, 3 = Ci 1I)2 C[ 1|2])3 , (5.4) 

which acts on a basis according to 

FA(a = 0)x, 2 , 3 : |6> 1 |c> 2 |0> 3 ^ ^{b © c) 2 |6 A c) 3 . (5.5) 

Here, the string of controlled fc -NOT's defining FA is to be read from right to left; that is, 
the gate furthest to the right acts on the kets first. The operation FA(a = 0) is shown 
diagramatically in Fig. ^a, where, in keeping with our convention for operator ordering, the 
gate on the right acts first; hence, in the diagram, "time" runs from right to left. To add 
the classical bit a = 1, we construct 

FA(a—l)i t 2,3 = C[l],2C[l,2],3C , 2C'[2],3 , (5.6) 

(see Fig. 0b) which acts as 

FA(a = 1)1,2,3 : |6) 1 |c) 2 |0) 3 1— > \b) x \b © c © l) 2 |c' = b V c) 3 . (5.7) 

Eqs. ( |5.4|) and ( |5.6|) provide an elementary example that illustrates the concept of a quan- 



tum computer controlled by a classical computer, as discussed in Sec. [TV A| . The classical 
computer reads the value of the classical bit a, and then directs the quantum computer to 
execute either FA(0) or FA(1). 

As we have already remarked in Sec. [IV (J , to perform modular arithmetic efficiently, 
we will construct a "multiplexed" full adder. The multiplexed full adder will choose as its 
classical addend either one of two classical bits ao and a\, with the choice dictated by the 
value of a "select qubit" I. That is, if £ = 0, the classical addend will be a , and if £ = 1 
the classical addend will be ai. Thus the multiplexed full adder operation, which we denote 
MUX FA' , will actually be 4 distinct unitary transformations acting on the qubits of the 
quantum computer, depending on the four possible values of the classical bits (ao, ai ). The 
action of MUX FA' is 

MUXFA'(a , ai ) 1:2 , 3A : |£) 1 |6) 2 |c) 3 |0) 4 1— > |^} 1 |6) 2 |s> 3 |c') 4 ; (5.8) 



here s and d are the sum and carry bits defined in Eqs. ( p-2[ ) and ( |5.3j ), but where now 
a ee ai A £ V ao A ~£ = a^. 

In fact, for ao — Oi, the value of the select qubit £ is irrelevant, and MUX FA' reduces 
to the FA operation that we have already constructed: 

MUXFA'(a = 0,ai = 0)i AM = FA{0) 2>3A 

MUXFA'(a = 1, Ql = 1) 1)2 , 3 ,4 ee FA(1) 2 , 3 , 4 • (5.9) 
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For do = and a\ = 1, MUXFA' adds £, while for ao = 1 and a\ = 0, it adds This is 
achieved by the construction (Fig. |3[) 



MUXFA'(a = 0,ai = 1)1,2,3,4 = C[2],3C[2,3],4C[i],3C[i,3],4 , 

MUXFA'(a = 1, ai = 0)1,2,3,4 = CiC[2],3C[2,3],4C[i],3C[i,3],4Ci • (5.10) 

(The second operation is almost the same as the first; the difference is that the qubit i is 
flipped at the beginning and the end of the operation.) 

The full adder that we will actually use in our algorithms will be denoted MUXFA 
(without the '). As noted in Sec. |I V D| , to perform multiplication and modular exponentia- 
tion, we will need a (multiplexed) full adder that is controlled by an enable bit, or a string of 
enable bits. Thus MUXFA will be an extension of the MU XFA' operation defined above 
that incorporates enable bits. If all the enable bits have the value 1, MUXFA acts just like 
MUXFA' . But if one or more enable bit is 0, MUXFA will choose the classical addend to 
be 0, irrespective of the values of ao and a\. We will use the symbol L to denote the full list 
of enable bits for the operation. Thus the action of MUXFA can be expressed as 

MUXFA{a Q , ai ) lcU ,2 M : \£) 1 \b) 2 \c) 3 \0) 4 ^ \t) ,\b) 2 \s) ^) (5.11) 

here s and d are again the sum and carry bits defined in Eqs. ( |5.19| ) and ( |5.3| ), but this time 
a = £ A {ai A t V ao A ~£); that is, it is unless all bits of C take the value 1. The list C 
may not include the bits 1, 2, 3, or 4. 

In our algorithms, the number of enable bits will be either 1 or 2. Hence, there is a simple 
way to construct the MUXFA operation on our "enhanced machine" that comes equipped 
with controlled 3 -NOT and controlled 4 -NOT gates. To carry out the construction, we note 
by inspecting Eq. (|5.9| , |5.iq) (or Fig. |) that MUXFA'(a , a x ) has the form MUXFA'(0, 0) • 



F(a ,ai); thus, by adding C to the list of control bits for each of the gates in F(ao,ai), 
we obtain an operation that acts as MUXFA' when £ is all l's, and adds otherwise. 
Explicitly, we have 



MUXFA(a 


= 0. 


ai 


— 0)[£], 1,2,3,4 


= c m,3 c 


MUXFA(a 


= 1- 


ai 


= !)[£], 1,2,3,4 


= cppc 


MUXFA(a 


= 0. 


ai 


= 1)[£], 1,2,3,4 


= CppC 


MUXFA(a 


= 1- 


a 1 


= 0)[£], 1,2,3,4 


= ClC[2],. 



2,3],4C[£,1],3C[£, 1)3 ] )4 Ci . (5.12) 

(as indicated in Fig. |j). Here, if £ is a list of j bits, then C[£,i,3j,4, for example, denotes the 
controlled (j+2) -NOT with C, 1, 3 as its control bits. Evidently, Eq. ([5T2D is a construction 
of a multiplexed adder with j enable bits in terms of controlled fc -NOT gates with k < j + 2. 
In particular, we have constructed the adder with two enable bits that we will need, using 
the gates that are available on our enhanced machine. 

The reader who is impatient to see how our algorithms work in detail is encouraged to 
proceed now to the next subsection of the paper. But first, we would like to dispel any notion 
that the algorithms make essential use of the elementary controlled 3 -NOT and controlled 4 - 
NOT gates. So let us now consider how the construction of the MUXFA operation can be 
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modified so that it can be carried out on the basic machine (which is limited to controlled fc - 
NOT gates with k < 2). The simplest such modification requires an extra bit (or two) of 
scratch space. Suppose we want to build a MUXFA" operation with a single enable bit, 
without using the controlled 3 -NOT gate. For the construction in Eq. (|5.12|) need 

not be modified; in those cases, the action of the operation is independent of the select bit 
£, and therefore no controlled 3 -NOT gates were needed. For ao 7^ cti, controlled 3 -NOT gates 
are used, but we note that the control string of these controlled 3 -NOT gates includes both 
the enable bit and the select bit. Hence, we can easily eliminate the controlled 3 -NOT gate 
C[£,i,3],4 by using a controlled 2 -NOT to compute (and store) the logical AND (£ A I) of the 
enable and select bits, and then replacing the controlled 3 -NOT by a controlled 2 -NOT that 
has the scratch bit as one of its control bits. Another controlled 2 -NOT at the end of the 
operation clears the scratch bit. In an equation: 



(5.13) 



MUXFA"(a =0,ai = 1)[£], 1,2,3,4,5 = C[£,1],5C[2],3C[2,3],4C[5],3C[5,3],4C[/;,1],5 , 

MUXFA"(a = l,ai = 0) [r], 1,2,3,4,5 = ^i^£, i],5 ( ^[2],3C'[2,3],4C , [5],3C'[5,3],4C , [£, 11,5^1 ; 

as illustrated in Fig. [5]. If the scratch bit |-) 5 starts out in the state |0) 5 , MUXFA" has the 
same action as MUXFA, and it returns the scratch bit to the state |0) 5 at the end. By 
adding yet another bit of scratch space, and another controlled 2 -NOT at the beginning and 
the end, we easily construct a MUXFA operation with two enable bits. 

At the alternative cost of slightly increasing the number of elementary gates, the extra 
scratch bit in MUXFA" can be eliminated. That is, an operation with precisely the same 
action as MUXFA can be constructed from controlled fc -NOT gates with k < 2, and without 
the extra scratch bit. This construction uses an idea of Barenco et al. | 28fl , that a controlled fc - 
NOT can be constructed from two controlled ( - fc_1 - ) -NOT's and two controlled 2 -NOT's (for any 
k > 3) by employing an extra bit. This idea differs from the construction described above, 
because the extra bit, unlike our scratch bit, is not required to be preset to at the beginning 
of the operation. Hence, to construct the C^xsiA gate needed in MUXFA, we can use \b) 2 
as the extra bit. That is, we may use the Barenco et al. identity 

C[£,l,3],4 = C'[2,3],4C , [£,i],2C[2,3],4C , [£,i],2 (5-14) 

to obtain, say, 

MUXFA'"(a = 0, a 1 = l)[£],l, 2 ,3,4 = ^[2],3C'[2,3],4C'[£,l],3C , [2,3l,4C , [i;,ll,2C , [2,3],4C'[£,l],2 (5.15) 



(as in Fig. [|). This identity actually works irrespective of the number of bits in the enable 
string C, but we have succeeded in reducing the elementary gates to those that can imple- 
mented on the basic machine only in the case of MUXFA with a single enable bit. To 
reduce the MUXFA operation with two enable bits to the basic gates, we can apply the 
same trick again, replacing each controlled 3 -NOT by four controlled 2 -NOT's (using, say, the 
4th bit as the extra bit required by the Barenco et al. construction). We will refer to the 
resulting operation as MUXFA"" . 

Aside from the multiplexed full adder MUXFA, we will also use a multiplexed half adder 
which we will call MUXHA. The half adder does not compute the final carry bit; it acts 
according to 
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MUXHA(a , ai ) lcll ^: |£) 1 |6> 2 |c) 3 |€> 1 |6> a |*>» , (5.16) 

where s = a © b © c, and a = C A (ai A £ V ao A (Note that, since the input qubit b 
is preserved, the final carry bit is not needed to ensure the reversibility of the operation.) 
MU XHA is constructed from elementary gates according to 



MUXHA(a 


= 0. 


a i 


= 0)[£], 1,2,3 


= C M,3, 


MUXHA(a 


= 1- 


a i 


= !)[£], 1,2, 3 


= C[ 2 ]^C 


MUXHA(a 


= 0. 


a i 


= 1)[£],1,2,3 


= C7| 2 ], 3 C 


MUXHA(a 


= 1, 


ai 


= 0)[C],1,2,3 


= CiCp],. 



(see Fig. For a single enable bit, this construction can be carried out on the basic 
machine. If there are two enable bits, the controlled 3 -NOT's can be expanded in terms of 
controlled 2 -NOT's as described above. 

A multiplexed X-bit adder is easily constructed by chaining together (K — 1) MUX FA 
gates and one MUX HA gate, as shown in Fig. |8]. This operation, which we denote MADD, 
depends on a pair of K-bit c-numbers a and a'. MADD (if all enable bits read 1) adds 
either a or a' to the K-bit g-number b, with the choice determined by the value of the select 
bit I. (That is, it adds a for £ = and adds a' for £ = 1.) Thus, MADD acts according to: 

MADD(a,a') Wrhl : \b) p \0)^ .— > \b) \s) 1 ; (5.18) 

where 

s = [b + £ A (a' A £ V a A ^)] mod 2 k • (5.19) 



The [ 

■]mod 2 K notation in Eq. ( pM9| ) indicates that the sum s residing in |-) at the end of 



the operation is only K bits long — MADD does not compute the final carry bit. Since we 
will not need the final bit to perform addition mod N, we save a few elementary operations 
by not bothering to compute it. (The MADD operation is invertible nonetheless.) 

Transcribed as an equation, Fig. ^ says that MADD is constructed as 
MADD(a,a')id0,y,i = MUXHA{a K -i,a' K _ l )m,x,p K _ 1 , 1K _ 1 



K-2 



IJ MUXFA(a h a[) lcUA ,^ +1 ) (5.20) 



i=0 



We have skewed the subscript and superscript of n m Eq. ( |5.20[) to remind the reader that 



the order of the operations is to be read from right to left — hence the product has the 
operator with i = furthest to the right (acting first). Each MUX FA operation reads the 
enable string C, and, if enabled, performs an elementary (multiplexed) addition, passing its 
final carry bit on to the next operation in the chain. The two classical bits used by the 
jth MUX FA are aj and a'j, the jth bits of the c-numbers a and a'. The final elementary 
addition is performed by MUX HA rather than MUX FA, because the final carry bit will 
not be needed. 
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C. Comparison 



In our algorithms, we need to perform addition mod N of a c-number a and a g-number b. 
An important step in modular addition is comparison — we must find out whether a + b > N. 
Thus, our next task is to devise a unitary operation that compares a c-number and a q- 
number. This operation should, say, flip a target bit if the c-number is greater than the 
g-number, and leave the target bit alone otherwise. 

A conceptually simple way to compare a K-bit c-number a and a K-bit g-number b is 
to devise an adder that computes the sum of the c-number 2 K — 1 — a and the g-number 
b. Since the sum is less than 2 K only for a > b, the final carry bit of the sum records the 
outcome of the comparison. This method works fine, but we will use a different method that 
turns out to be slightly more efficient. 

The idea of our method is that we can scan a and b from left to right, and compare them 
one bit at a time. If clk-i and bx-i are different, then the outcome of the comparison is 
determined and we are done. If a^-i and bx-i are the same, we proceed to examine 
and b K ^ 2 and repeat the procedure, etc. We can represent this routine in pseudo-code as 



To implement this pseudo-code as a unitary transformation, we will use enable qubits in 
each step of the comparison. Once the comparison has "ended," all subsequent enable bits 
will be switched off, so that the subsequent operations will have no effect on the outcome. 
Unfortunately, to implement this strategy reversibly, we seem to need a new enable bit for 
(almost) every step of the comparison, so the comparison operation will fill K — 1 bits of 
scratch space with junk. This need for scratch space is not really a big deal, though. We 
can immediately clear the scratch space, which will be required for subsequent use anyway. 

As in our construction of the adder, our comparison operation is a sequence of elementary 
quantum gates that depends on the value of the K-bit c-number a. We will call the operation 
LT (for "less than"). Its action is 




(5.21) 



if a — 1 : 



if a = : 




b > a END 
b < a END 



LT(a) PM : \b) JO^O) 



7 



fcO^KIjunk), 



7 ' 



(5.22) 
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where I takes the value 1 for b < a and the value for b > a. Here the register labeled |-)~ is 
actually K—l rather than K qubits long. The junk that fills this register has a complicated 
dependence on a and b, the details of which are not of interest. In passing, the LT operation 
also modifies the g-number b, replacing it by b' . (b' is almost the negation of b, b with all of 
its qubits flipped, except that 60 is not flipped unless a = 1). We need not be concerned 
about this either, as we will soon run the LT operation backwards to repair the damage. 

The LT operation is constructed from elementary gates as: 

LT(a)p t ifi = {if (ao = 1) C^ M i Cp } (5.23) 



K-2 



n 



i=l 




if (ok-_i = 0) Cip K _ 1 i^ K _ 2 C fe _ 1 

if (a K -i = 1) C[/3 A -_i],i C I 3 K _ 1 C[ fe _j 



lK-2 



As usual, the gates furthest to the right act first. We have skewed the subscript and 
superscript of n here to indicate that the operator with i = 1 is furthest to the left (and 
hence acts last). The first step of the LT algorithm is different from the rest because it 
is not conditioned on the value of any "switch." For each of the K — 2 intermediate steps 
(i = K — 2,K — 1, . . . , 1), the switch 7^ is read, and if the switch is on, the comparison of aj 
and bi is carried out. If 7^ bi, then the outcome of the comparison of a and b is settled; the 
value of £ is adjusted accordingly, and the switch %_i is not turned on. If a« = bi, then 
is switched on, so that the comparison can continue. Finally, the last step can be simplified, 
as in Eq. ( ggg ). 

We can now easily construct a comparison operator that cleans up the scratch space, and 



restores the original value of b, by using the trick mentioned in Sec. |IV B| — we run LT, copy 
the outcome i of the comparison, and then run LT in reverse. We will actually want our 
comparison operator to be enabled by a string C, which we can achieve by controlling the 
copy operation with C. The resulting operator, which we call XLT, flips the target qubit if 
b < a: 

XLT(a)lC],p,l,2/y = -^( a )/3,2,7 C[£,2],i LT{a)p^ '■ 

l&»>i|0} 2 |0>^ .— > \b)p\x © y^loyo}. (5.24) 

where y is 1 if b < a and otherwise. We recall that the register |-}» is actually K—l qubits 
long, so the XLT routine requires K qubits of scratch space. 



D. Addition mod N 



Now that we have constructed a multiplexed adder and a comparison operator, we can 
easily perform addition mod N . First XLT compares the c-number N — a with the g-number 
b, and switches on the select bit i if a + b < N . Then the multiplexed adder adds either a 
(for a + b < N) or 2 K + a - N (for a + b > N) to b. Note that 2 K + a - N is guaranteed to be 
positive (N and a are A'-bit numbers with a < N). In the case where 2 K + a — N is added, 
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the desired result a + b (mod N) is obtained by subtracting 2 K from the sum; that is, by 
dropping the final carry bit. That is why our MADD routine does not bother to compute 
this final bit. 

We call our mod N addition routine ADDN; it acts as 

ADDN(a,N) WjljJ : I^O)^ 

i — ► \b) p \e = C A (a + b < A r )) 1 |6 + C A a (mod N)) y . (5.25) 

(Here the notation £ = CA(a + b< N) means that the qubit I reads 1 if the statement C A 
(a + b < N) is true and reads otherwise.) If enabled, this operator computes a + b (mod N); 
if not, it merely copies ADDN is constructed from MADD and XLT according to 

ADDN(a, A0[£],Ai,7 = MADD(2 K + a - N, a) Wa>1 ■ XLT(N - a) w , ln (5.26) 

(see Fig. |9|). Note that XLT uses and then clears the K bits of scratch space in the register 
|-) 7 , before MADD writes the mod iV sum there. 

The ADDN routine can be viewed as the computation of an invertible function (specified 
by the c-numbers a and N) of the g-number b. (Note that the output of this function is 
the sum a + b (mod N) and the comparison bit I — the comparison bit is needed to ensure 
invertibility, since it is possible that b > N). Thus, we can use the trick mentioned in Sec. 
[IV B| to devise an "overwriting" version of this function. Actually, since we will not need to 
know the value of I (or worry about the case b > N), we can save a qubit by modifying the 
trick slightly. 

The overwriting addition routine OADDN is constructed as 

OADDN(a, N) lcmi>1 = SWAP Pn ADDN~\N - a, N) lcM 

■C l£U ADDN(a,N) W;la (5.27) 



see Fig |10|) , and acts (for b < N) according to 



OADDN(a,N) Wx ,: 1^0)^ 

i — ► |6) /3 |£ = /:A(a + 6<iV)) 1 |6 + /:Aa (mod iV)> 7 
i — ► 16)^ = C A (a + b > N))Ab + C A a (mod iV)) 7 
i— > |0> /? |0> 1 [fe + £ A a (mod N)) y 

i— > \b + C A a (mod N)) ^0) x \0) ^ . (5.28) 

Here, in Eq. ( |5.28|) , we have indicated the effect of each of the successive operations in Eq. 
Q5.271 ). We can easily verify that applying ADDN(N — a, iV)[£] j7j i )y g to the second-to-last line 



7 Thus, if ADDN is not enabled, Eq. flUf ) is valid only for b < N. We assume here and in 
the following that b < N is satisfied; in the evaluation of the modular exponential function, our 
operators will always be applied to (/-numbers that satisfy this condition. 
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of Eq. (|5.28|) yields the preceding line. If the enable string £ is false, the verification is trivial, 
for b < N. (It was in order to ensure that this would work that we needed the XLT operation 
to be enabled by £.) When L is true, we need only observe that N — a+ [b + a (mod N)] < N 
if and only if a + b > N (assuming that b < N). 

The SWAP operation in Eq. ( |5.27| ) is not a genuine quantum operation at all; it is a 
mere relabeling of the and |-) registers that is performed by the classical computer. We 
have included the SWAP because it will be convenient for the sum to be stored in the 
register when we chain together OADDN's to construct a multiplication operator. We see 
that OADDN uses and then clears K + 1 qubits of scratch space. 



E. Multiplication mod N 

We have already explained in Sec. [IV D| how mod iV multiplication can be constructed 
from conditional mod iV addition. Implementing the strategy described there, we can con- 
struct a conditional multiplication operator MULN that acts according to 

MULN(a,N) lcm7>1)S : \b) p \0) AO) AO) s 

.— \b) p \£ A a ■ b (mod iV)) 7 |0) 1 |0) 5 . (5.29) 

If enabled, MULN computes the product mod N of the c-number a and the g-number b; 
otherwise, it acts trivially. 

We could construct MULN by chaining together K OADDN operators. The first 
ADDN loads a ■ b , the second adds a ■ 26 l5 the third adds a ■ 2 2 6 2 , and so on. But we can 
actually save a few elementary operations by simplifying the first operation in the chain. 
For this purpose we introduce an elementary multiplication operator EMUL that multiplies 
a c-number a by a single qubit 60: 

EMUL{a) w>1 : |&o}i|0} 7 1 — > |&o)i|£ A a ■ b ) 1 , (5.30) 
which is constructed according to 

EMUL(a) l£]Xl = K j[ if (a ± = 1) C IA11 , 7j . (5.31) 

i=0 

Now we can construct MULN as 

K-1 

MULN (a, N) w , 7 , h5 = J[ OADDN(T ■ a (mod N), N) lCM , 7 ,i,s 

i=i 

■EMUL{a) lcUoil (5.32) 

(see Fig. [□]). Note that the computation of T ■ a (mod N) is carried out by the classical 
computer. (It can be done efficiently by "repeated doubling.") 

As long as a and iV have no common divisor (gcd(a, N) = 1), the operation of multiplying 
by a (mod N) is invertible. In fact, the multiplicative inverse a -1 (mod N) exists, and 
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MULN(a) is inverted by MULN(a 1 ). Thus, we can use the trick discussed in Sec. [IV B 



to construct an overwriting version of the multiplication operator. This operator, denoted 
OMULN, acts according to 

OMULN{a,N) lclAlX5 : |&) /3 |0) 7 |0) 1 |0) 5 

i — ► \C A a ■ b (mod N) V ~£ A 6) ja |0) 7 |0) 1 |0) a , (5.33) 

Note that OMULN acts trivially when not enabled. It can be constructed as 

OMULN{a,N) WilXS = XOR w>1 ■ XOR lclj>(3 

■MULN-\a-\N) {cl ^ x5 ■ MU LN(a, N) WnX s (5.34) 



(see Fig. [12|). Here, the (conditional) XOR operation is 

L-l 



XOR lC ia,p= X\C lcMA : \a) a \b)p^\a) a \b®(aAC))p (5.35) 



i=0 



where © denotes bitwise addition modulo 2. It is easy to verify that, when enabled, OMULN 
acts as specified in Eq. (|5.33|) ; the two XOR's at the end are needed to swap (0)^ and 
| a • b (mod N)) . To verify Eq. (|5.33j ) when OMULN is not enabled, we need to know that 



MULN, when not enabled, acts according to 



MULN(a,N) lc ^ lU , 7X5 : 10)^6)^0)^0), 

>— ^lO^li^lOKIO),. (5.36) 



Though Eq. Q5.36| ) does not follow directly from the defining action of MULN specified in 



Eq. ( |5.29|) , it can be seen to be a consequence of Eq. (|5.32| , |5.28|) . Note that the computation 



of a 1 is performed by the classical computer. (This is, in fact, the most computationally 
intensive task that our classical computer will need to perform.) 

We will require the OMULN operator with an enable string £ that is only a single 
qubit. Thus the construction that we have described can be implemented on our enhanced 
machine. So constructed, the OMULN operator uses (and then clears) 2K + 1 qubits of 
scratch space. This amount is all of the scratch space that will be required to compute the 
modular exponential function. 

If we wish to construct OMULN on the basic machine (using controlled fc -NOT's with k = 
0, 1, 2), there are several alternatives. One alternative (that requiring the fewest elementary 
gates) is to use two additional qubits of scratch space {2K + 3 scratch qubits altogether). 
Then, when MULN calls for OADDN with two enable bits, we use one of the scratch 
qubits to store the logical AND of the two enable bits. Now OADDN with one enable bit 
can be called instead, where the scratch bit is the enable bit. (See Fig. [13|.) When OADDN 
eventually calls for MUX FA with a single enable bit, we can use the second extra scratch 
qubit to construct MUXFA" as in Eq. ( |5.13| ) and Fig. |5|. Of course, another alternative is 



to use the Barenco et al. identity Eq. ( 5.14 ) repeatedly to expand all the controlled 3 -NOT 



and controlled 4 -NOT gates in terms of controlled fc -NOT gates with k — 0, 1, 2. Then we can 
get by with IK + 1 bits of scratch space, but at the cost of sharply increasing the number 
of elementary gates. 
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F. Modular exponentiation 



The operator EXPN that computes the modular exponentiation operator can now be 
constructed from the conditional overwriting multiplication operator, as outlined in Sec. 
IV E . Its action is: 

EXPN(x,N) aA7XS : KIO^IOJJO), 

^\a)l\x a (modN))^)^)^),. (5.37) 

(Recall that |-)* denotes a register that is L qubits long; N and x are K-bit c-numbers.) It 
is constructed as 

EXPN{x,N) aAy>ltS = ( I| OMULN(x T (mod N), N) laiJAl ,^ Cp (5.38) 

(Fig. 0). Note that the Cp is necessary at the beginning to set the register \-) ~ to 1 (not 

0). The classical computer must calculate each x 2 ' and each inverse x _2 \ The computation 
of x _1 (mod N) can be performed using Euclid's algorithm in 0(K 3 ) elementary bit opera- 
tions using "grade school" multiplication, or more efficiently using fast multiplication tricks. 
Fortunately, only one inverse need be computed — the x~ 2 ' 's, like the x 2,1 's, are calculated by 
repeated squaring. 

Actually, it is possible to reduce the number of quantum gates somewhat if the NOT 
and the first OMULN in Eq. ( 5.38|) are replaced by the simpler operation 

(C ao C laolf3o C ao ) ■ EMUL{x) ao>0 . (5.39) 

It is easy to verify that this operator has the same action on the state |ao) ao |0)fl as 
OMULN(x, N)i aQ j t /3^ t i t s ■ C/3 . With this substitution, we have defined the EXPN op- 
eration whose complexity will be analyzed in the following section. 



VI. SPACE VERSUS TIME 



Now that we have spelled out the algorithms in detail, we can count the number of 
elementary quantum gates that they use. 



A. Enhanced machine 



We will use the notation 

[OPERATOR] = [c , a, c 2 , c 3 , c 4 ] (6.1) 

to indicate that OPERATOR is implemented using c NOT gates, C\ controlled-NOT gates, 
C2 controlled 2 -NOT gates, C3 controlled 3 -NOT gates, and C4 controlled 4 -NOT gates on the 
enhanced machine, or 
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[OPERATOR] = [c ,ci,c 2 ] 



(6.2) 



to indicate that OPERATOR is implemented using c NOT gates, C\ controlled-NOT gates, 
and c 2 controlled 2 -NOT gates on the basic machine. By inspecting the network constructed 
in Sec. |V|, we see that the following identities hold: 

[EXPN] = (L — 1) • \OMULNm\ + [EMU L] + [controlled-NOT] + 2 • [NOT] ; 



OMULN n 
MULN {1] 
OADDN [2] 
ADDN [2] 
M ADD 
XLT 2 ] 



= 2- 
(K- 
= 2- 



MULNi 



[i] 



XOR m 



OADDN m 



+ 



EMUL m 



ADDN\ 



2] 



controlled-NOT 



MADD 



[2] 



XLT 



(K-l) 



MUXFA 



+ 



MUXHAm 



2 • [LT] 



controlled-NOT 



(6.3) 



These equations just say that OMULN^], say, is constructed from 2 MULN^s and 
2 XOR[ifs, and so forth. The subscript u indicates the length of the string of enable 
bits for each operator. By combining these equations, we find the following expression for 
the total number of elementary gates called by our EXPN routine: 



[EXPN] = (L 



1).{ 4(^-1) 
+ 8(K-1) 
+ 2{K-1) 



MUXFA [2 
[LT] + A(K - 1) ■ 
controlled 2 -NOT 



A(K- 1 
controlled 3 



MUXHA [2] 
NOT 



+ 2 



EMUL m 



+ 2 



XOR 



[i] 



+ [EMUL] + [controlled-NOT] + 2 • [NOT] . 



(6.4) 



By plugging in the number of elementary gates used by MUXFA, MUXHA, LT, EMUL, 
and XOR, we can find the number of controlled fc -NOT gates used in the EXPN network. 

For large K, the leading term in our expression for the number of gates is of order LK 2 . 
Only the MUXFA and LT operators contribute to this leading term; the other operators 
make a subleading contribution. Thus 



[EXPN] = (aLK 2 - 



MUXFA 



[2] 



+ 8LK ■ [LT] ) (1 + 0(1/ K) 



(6.5) 



We will now discuss how this leading term varies as we change the amount of available 
scratch space, or replace the enhanced machine by the basic machine. 

The numbers of elementary gates used by MUXFA and by LT actually depend on 
the particular values of the classical bits in the binary expansions of 2Px ±T (mod N) and 
2 K - N + 2 j x ±2 ' (mod N), where j = 1, . . . , K — 1 and i = 0, 1, . . . , L - 1. We will estimate 
the number of gates in two different ways. To count the gates in the "worst case," we always 
assume that the classical bits take values that maximize the number of gates. To count in 
the "average case," we make the much more reasonable assumption that the classical bits 
take the value with probability | and take the value 1 with probability |. 
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For example, in the case of the implementation of MUX FA™ on the enhanced machine 
described in Eq. (|5.12|) , counting the operations yields 



MUXFA{0,0) [2] 
MUXFA{1,1) [2] 
MUXFA(0,1) [2] 
MUXFA(1,0) [2] 



and thus 



MUXFA 



[2] 



= [0,1,1,0,0], 

= [0,1,2,1,0], 

= [0,1,1,1,1], 

= [2,1,1,1,1], 
[2,1,2,1,1], 



(6.6) 





ave 


"1 




3 


r 


'muxfa [2 { 






4 








.2 




4' 


2. 



(6.7) 



That is, the worst case is the maximum in each column, and the average case is the mean 
of each column. When we quote the number of gates without any qualification, the average 
case is meant. Similarly, for the LT operation described in Eq. ( |5.23j ), we have 



[LT] 
[LT] 



worst 



[^,2,2^-3,0,0], 



K 



1 3 3 
2' 2' 2' 



-K 



;,0,0 



(6.8) 



Note that LT uses no controlled 3 -NOT or controlled 4 -NOT gates, and so can be implemented 
as above on the basic machine. 



Now, from Eq. (|6.5|) , we find the leading behavior of the number of gates used by the 
EXPN routine: 



(6.9) 



[EXPN}:2t ced ,2K + i = LK 2 - [16,4,24,4,4] • ( 1 + 0(l/#) 
[EXPN]ZL« ced ,2K + i = LK 2 • [10, 4, 17, 3, 2] • ( 1 + 0(1/ K) ) , 



where the subscript enhanced, 2K+1 serves to remind us that this count applies to the enhanced 
machine with 2K + 1 qubits of scratch space. A convenient (though quite crude) "one- 
dimensional" measure of the complexity of the algorithm is the total number of laser pulses 
required to implement the algorithm on a linear ion trap, following the scheme of Cirac and 
Zoller. Assuming 1 pulse for a NOT and 2k + 3 pulses for a controlled fc -NOT, k = 1, 2, 3, 4, 
we obtain 



[EXPN\Sl&k + i = 256LK 2 .(l + 0(l/K) 
[EXPN)Zl^ s i2K+1 = 198LK 2 .(l + 0(l/K)). 



(6.10) 



(The estimate for the worst case is not obtained directly from Eq. ( |6.9| ); instead we assume 
that MUXFA is always called with the argument (ao = 1, d\ = 0) — this maximizes the 
number of pulses required, though it does not maximize the number of controlled 2 -NOT 
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gates.) Including the sub leading contributions, the count of gates and pulses used by our 
network in the average case is 

[EXPN}ZLn Ccd ,2K + i = (L-l)- [10K 2 - 1AK + 4, AK 2 + 8K - 12, 17K 2 - 36K + 22, 

3K 2 - 3, 2K 2 - AK + 2] + [2, \-K + 1, 0, 0, 0] , 

Z 

[EXPN]Zl^:r, 2K+ i = (L-l). (198K 2 - 279K + 93) + \k + 7. (6.11) 

By allowing one extra qubit of scratch space, we can reduce the complexity (measured in 
laser pulses) somewhat. When MULN^ calls for OADDNp], we may use a controlled 2 -NOT 
to store the AND of the two enable bits in the extra scratch qubit, and then call OADDN^ 
instead, with the scratch bit as the enable bit. The extra controlled 2 -NOT's that compute 
and clear the AND bit do not affect the leading behavior of the count of elementary gates. 
The only effect on the leading behavior is that MUXFAp] can be replaced by MUXFA^q, 
for which 



MUXFA W 
MUXFA {1] 



1 worst 



[2,2,2,1,0], 
15 7 1 
2'4'4' 2' 



(6.12) 



Hence we find 

[EXPN}:^ nccdj2K+2 = LK 2 ■ [16, 8, 24, 4, 0] • ( 1 + 0(1/ K) ) , 

[EXPN}Z^ nccdt2K+2 = LK 2 ■ [10, 5, 19, 2, 0] • ( 1 + 0(1/ K) ) , (6.13) 

and 

lEXPN}:ZtlZ e K + 2 = 2A0LK 2 -(l + O(l/K)), 

[EXPN}-^ 2K+2 = 186LK 2 • ( 1 + 0(1/ K) ) . (6.14) 

The precise count in the average case is 

[EXPN}Z^ nced , 2K+2 = (L-l)- [10K 2 - 1AK + 4, 5K 2 + 10 K - 14, 19K 2 - UK + 21, 

2K 2 - AK + 2, 0] + [2, \k + 1, 0, 0, 0] , 

z 

[EXPN}™^^ = (L-l)- (18QK 2 - 238K + 99) + \k + 7 . (6.15) 
Note that, in this version of the algorithm, no controlled 4 -NOT gates are needed. 



B. Basic machine 



Now we consider the basic machine, first with 2K + 3 bits of scratch space. We use 
one of our extra scratch bits to combine the enable bits for OADDN as explained above. 
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The other extra bit is used to replace MUXFA^ by the version MUX FA'™ given in Eq. 
( 5.13| ) — MUX FA™ uses only the gates available on the basic machine. The new count is 



MUXFA'fa 
MUXFA'{q 



worst 



[2,2,4], 
1 7 11 
2'4'T 



(6.16) 



The LT operation need not be modified, as it requires no controlled 3 -NOT or controlled 4 - 
NOT gates. We therefore find 



[EXPN]rX 2K+3 = LK 2 ■ [16, 8, 32] • ( 1 + 0(1/ K) 
[EXPN}lZ c ,2K + 3 = LK 2 • [10, 7, 23] • ( 1 + 0(1/10 ) , 



and 



r W v D Arl worst pulses 
[•^ A ^ V Jbasic,21f+3 



280LK 2 
20QLK 2 ■ 



-0{1/K) 
0(1/ K) 



(6.17) 



(6.18) 



With the subleading corrections we have in the average case 

[EXPN]^ ic2K+3 = (L - 1) • [10K 2 - UK + 4, 7K 2 + QK- 12, 23K 2 - 42K + 25] 



+[2, -if + 1,0] 



5 



[EXPN}Z^k1 3 = (L-l)- (206K 2 - 278K + 119) + -K + 7. 



(6.19) 



We can squeeze the scratch space down to 2K + 2 bits if we replace MUXFA'™ by 
MUXFA'fa given in Eq. (|5.15 ), which does not require an extra scratch bit. The gate count 
becomes 



MUXFA'fc 
MUXFA'fc 



worst 



[2,2,6], 
1 5 15 
2'4'T 



so that we now have 

[EXPN]™X2K + 2 

[EXPN}^ sic2K+2 



LK 2 ■ [16,8,40] ■ (l + 0(l/A) 
LK 2 ■ [10,5,27] ■ (l + 0(l/K) 



and 



[EXPNZXiK+f = 316LK 2 • ( 1 + 0(1/ K) 
[EXPN}IZZk S + 2 = 22ALK 2 • ( 1 + 0(1/ K) 



(6.20) 



(6.21) 



(6.22) 
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The precise count of gates and pulses in the average case is 
[EXPN\l™ ic 2K+2 = (L — 1) • [10K 2 - UK + 4, 5K 2 + 10K - 14, 27K 2 - 50K + 

1,0], 



+ 



[EXPN] 



ave pulses 
basic,2K+2 



(L - 1) ■ (22AK 2 - 3UK + 137) + -K + 7. 



(6.23) 



To squeeze the scratch space by yet another bit, we must abandon the extra bit used by 
MULN . We then construct MUXFA 1 '^ by expanding the controlled 3 -NOT and controlled 4 - 
NOT gates in terms of controlled 2 -NOT gates, as discussed in Sec. |VB| . We find that 



MUXFA'l^ 



MUX FAX 



[2,1,15], 
1 37 
2' 'T 



(6.24) 



therefore, 



[EXPiVC-^ = LK 2 ■ [16, 4, 76] • ( 1 + 0(1/ K) 
[EXPN]lZ ic ,2K + i = LK 2 ■ [10, 4, 49] • ( 1 + 0(1/ K) 



(6.25) 



and 



[EXPN]Z£$Z = 568LK 2 ■ ( 1 + 0(l/lf) ) , 
[EXPN]Z^kIi = 3KLK 2 ■ ( 1 + 0(1/ K) 



[EXPN] 



[EXPN] 



ave 

basic, 2K+1 



ave pulses 
basic, 2K+1 



(6.26) 

(L — 1) ■ [10K 2 - UK + 4, AK 2 + 8K - 12, A9K 2 - 76fT + 30] 



Including the subleading corrections the count in the average case is 



+[2, -if + 1,0], 
(L — 1) ■ (373K 2 - 506K + 154) + ^K + 7. 



(6.27) 



Our results for the average number of gates and pulses are summarized in the following 
table: 





basic 


enhanced 


scratch 


gates 


pulses 


gates 


pulses 


2K+1 


[10,4,49] 


373 


[10,4,17,3,2] 


198 


2K+2 


[10,5,27] 


224 


[10,5,19,2,0] 


186 


2K+3 


[10,7,23] 


206 





(6.28) 

Each entry in the table is the coefficient of LK 2 (the leading term) in the number of gates 
or pulses, where the notation for the number of gates is that defined in Eq. ( |6.iy6.2j ). Of 
course, the numbers just represent our best effort to construct an efficient network. Perhaps 
a more clever designer could do better. 
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C. Unlimited Space 



The gate counts summarized in Eq. ( |6.2<j| ) provide study" of the tradeoff between 

the amount of scratch space and the speed of computation. But all of the algorithms 
described above are quite parsimonious with scratch space. We will now consider how 
increasing the amount of scratch space considerably allows us to speed things up further. 

First of all, recall that our OADDN routine calls the comparison operator LT four 
times, twice running forwards and twice running in reverse. The point was that we wanted 
to clear the scratch space used by LT before MADD acted, so that space could be reused 
by MADD. But if we were to increase the scratch space by K — 1 bits, it would not 
be necessary for LT to run backwards before MADD acts. Instead, a modified OADDN 
routine could clear the scratch space used by LT and by MADD, running each subroutine 
only twice (once forward and once backward). 

Thus, with adequate space, we can replace Eq. ( |6.5|) with 

[EXPN] = ( ALK 2 ■ \MUXFA [V \ + ALK ■ [LT] ) ( 1 + 0(1/ K) ) . (6.29) 

Using this observation, we can modify our old network on the enhanced machine (with 
2K + 2 bits of scratch) to obtain 

[EXPiV]- C h anced,3^ + l = LK 2 ■ [6, 5, 13, 2, 0] • ( 1 + 0(1/ K) ) , 

[EXPN)Zl^ !3K+1 = UOLK 2 ■ ( 1 + 0(1/ K) ) , (6.30) 
about 25% faster. 

To do substantially better requires much more space. Optimized for speed, our algorithms 
will never clear the scratch space at intermediate stages of the computation. Instead, EXPN 
will carry out of order LK additions, filling new space each time a comparison is performed 
or a sum is computed. Once the computation of x a (mod N) is complete, we copy the result 
and then run the computation backwards to clear all the scratch space. But with altogether 
~ LK ADDN's, each involving a comparison and a sum, we fill about 2LK 2 qubits of 
scratch space. Combining the cost of running the gates forward and backward, we have 

[EXPN] ■ [EXPN' 1 ] = ( ILK 2 ■ \MUXFA {V \ + ILK ■ [LT] ) ( 1 + 0(1/ K) ) , (6.31) 



and therefore 



[EXPN]Zl anced ^ 2LK2 = LK 



2 



3, -,-,1,0 
.'22 



1 + 0(1/K) 



[EXPN\Z££~*lk> = ™ LK * ■ ( 1 + 0(1/ K) ) , (6.32) 
another factor of 2 improvement in speed. 

For asymptotically large K, further improvements are possible, for we can invoke classical 
algorithms that multiply i^-bit numbers in time less than 0(K 2 ). The fastest known, the 
Schonhage-Strassen algorithm, requires 0(K log-fT log log i^) elementary operations [25]. It 
thus should be possible to perform modular exponentiation on a quantum computer in a 
time of order LK log K log log K. We have not worked out the corresponding networks in 
detail, or determined the precise scratch space requirements for such an algorithm. 
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D. Minimal Space 



Now consider the other extreme, where we disregard speed, and optimize our algorithms 
to minimize space. 

Since addition is an invertible operation, it is possible to construct a unitary "overwriting 
addition" operator that adds a c-number to a g-number and replaces the q number addend 
with the sum. But the construction of our OADDN operator involved two stages — first 
we performed the addition without overwriting the input, and then ran the addition routine 
backwards to erase the input. Thus, our overwriting OADDN routine for adding a if -bit 
c-number to a if -bit g-number (mod N) required if + 1 bits of scratch space. 

There is no reason in principle why this scratch space should be necessary (though 
eliminating it may slow down the computation). In fact, we will show that it is possible 
to add without using any scratch space at all. Of course, we will still need a comparison 
bit to perform mod N addition. And there is no obvious way to eliminate the need for a 
if -bit scratch register that stores partial sums when we multiply. Still, using overwriting 
addition, we can construct an EXPN operator that requires just K+l bits of scratch space 
(compared to 2if + 1 in our best previous effort). The price we pay is that the computation 
slows down considerably. 

The key to adding without scratch space is to work from left to right instead of right 
to left. It is sufficient to see how to add a single-bit c-number a to a if -bit g-number b, 
obtaining a (if + l)-bit g-number. Of course, if the classical bit is 0, we do nothing. If the 
classical bit is 1, we perform addition by executing the pseudo-code: 

if b K _x = b K _ 2 = • • • = h = b = 1 : flip b K 
if b K - 2 = b K -3 = • • • = h = b = 1 : flip b K _ x 

if b\ = b = 1 : flip 6 2 

if b Q = 1 : flip bi 

flip b (6.33) 

Thus, the operator 

ADD(ao)p KtP = if (a = 1) 

C Po C[A)],/3i • • • C[/3o,/3i.../3 if - 2 ],&-i C[A,,/3i-fe-i],fc (6.34) 

has the action 

ADD{ao)p K ,p : lO)^), — > |(6 + a ) K )^\b + a )^ . (6.35) 

It fills the K + l qubits \-) n K \-) n with the (K + l)-bit sum b + a . To add a if -bit c-number 
a to the if -bit g-number b, we apply this procedure iteratively. After adding do to b, we 
add ai to the (if — l)-qubit number bx-ibK-2 ■ ■ ■ then add a 2 to the (if — 2)-qubit 



41 



number bx-ibx^ ■ ■ • ^3&2, and so on. Thus, the computation of b + a requires in the worst 
case (a = 111 ... 11) a total number of operations 

[ADD (a)] = [K,K,K-1,K-2,...,2,1]; (6.36) 

that is, K NOT's, K controlled-NOT's, K — 1 controlled 2 -NOT's, . . ., 2 controlled^ 1 - 
NOT's, and 1 controlled K -NOT. In the average case (where half the bits of a are zero), 
only half of these gates need to be executed. For the Cirac-Zoller device, figuring 2fc + 3 
laser pulses for a controlled fc -NOT with k > 1, and one pulse for a NOT, this translates 
into ~K (2K 2 + 15K + 19) laser pulses for each K-bit addition, in the worst case, or in the 
average case 

[ADD]Z^ h = \& + -K 2 + f 2 K . (6.37) 

We can easily promote this operation to a conditional ADD with £ enable bits by simply 
adding the enable qubits to the control string of each gate; the complexity then becomes 

r 1 ave pulses I o / 1 5 \ o / 3 31\ 

ADDJ =-K 3 + (-£+- )K 2 + (-£ + — )K, £>1. 6.38 

L no scratch 6 \2 4/ \2 12/ 



We will need to add mod N. But if we can add, we can compare. We can do the 
comparison of iV — a and b by adding (2 — N + a) to 6; the final carry bit will be 1 only 
for a + b > N. Thus, we can use the overwriting addition operation ADD in place of LT to 
fix the value of the select bit, and then use a multiplexed version of ADD to complete the 
mod iV addition. Following this strategy, we construct an overwriting mod iV adder that 
uses just one qubit of scratch space according to 

OADDN'(a,N) lcWK = 

ADD(a) mM ■ MADD'(N -a,2 K - a) lcjM ■ ADD(2 K -N + a) Wx fi ■ 
l°>Arl&>/J — » + CAa ( mod • ( 6 - 39 ) 

Here each ADD operation computes a (K + l)-bit sum as above, placing the final carry bit 
in the qubit \-)q k ', however MADD' computes a K-bit sum - it is a multiplexed adder that 
adds iV — a if the select bit \')a reads 0, and adds 2 K — a if the select bit reads 1. The 
construction of MADD' follows the spirit of the construction of MADD described in Sec. 



VB. In the average case, the number of laser pulses required to implement this OADDN' 



operation is 



OADDNfa 



ave pulses 7 ( 7 „ 33\ Tr0 ^15 „ 169 



K 6 + [-£ + — )K 2 + — £ + )K. (6.40) 

1 scratch 12 V4 8 J V 4 24 ' 



The construction of the modular exponentiation operator EXPN from this OADDN' 
operator follows the construction described in Sec. 0. Thus, using the expression for 



[EXPN] in terms of OADDN [2] implicit in Eq. (p^D, we find that with K + 1 qubits 
of scratch space, the EXPN function can be computed, in the average case, with a number 
of laser pulses given by 
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[EXPN}™ e + r lseS = ( L 



7 tM 169 ^ 3 83 ^ 2 97 \ 5^ n 

-K A H K A H K 2 K ) + -K + 7 . 

6 12 6 12/2 



(6.41) 



For small values of K (K < 7), fewer pulses are required than for the algorithms described 
in Sec. f7T7A and jVTTJ 



VII. N = 15 



As we noted in Sec. |11 F| , Shor's factorization algorithm fails if N is even or a prime 
power (N = p a , p prime). Thus, the smallest composite integer N that can be successfully 
factored by Shor's method is N = 15. Though factoring 15 is not very hard, it is amusing to 
consider the computational resources that would be needed to solve this simplest of quantum 
factoring problems on, say, a linear ion trap. 

Appealing to Eq. (|6.11| ), with K = 4 and L = 2K = 8, our "average case" estimate of 
the number of laser pulses required on a machine with altogether K + L + (2K + 1) = 21 
qubits of storage is 15,284. With 22 qubits of storage, our estimate improves to 14,878 
pulses. With another three qubits (25 total), we can use the technique described in Sec. 
|V1 C| to achieve a further improvement in speed. 

Several observations allow us to reduce these resources substantially further. First of 
all, we notice that, for any positive integer x with x < 15 and gcd(x, 15) = 1 (i.e., for 
x = 1, 2, 4, 7, 8, 11, 13, 14), we have x A = 1 (mod 15). Therefore, 

x a = x 2 ai . x a . ( 7 _-Q 



only the last two bits of a are relevant in the computation of x a . Hence, we might as well 
choose L = 2 instead of L = 8, which reduces the number of elementary operations required 
by a factor of about 7. (Even if the value of L used in the evaluation of the discrete Fourier 
transform is greater than 2, there is still no point in using L > 2 in the evaluation of the 
modular exponential function.) 

Second, we can save on storage space (and improve speed) by noting that the overwriting 
addition routine described in Sec. |V1D| is reasonably efficient for small values of K. For 
K = 4 and L = 2, we need 11 qubits of storage and an estimated 1406 laser pulses. 

For N = 15, the above is the most efficient algorithm we know that actually computes 
x a on the quantum computer. We can do still better if we are willing to allow the classical 
computer to perform the calculation of x a . Obviously, this strategy will fail dismally for 
large values of K — the classical calculation will require exponential time. Still, if our goal 
is merely to construct the entangled state 

J-J2\ a h\ xa ( mod N ))o , (7-2) 



while using our quantum computational resources as sparingly as possible, then classical 
computation of x a is the most efficient procedure for small K. 
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So we imagine that x < 15 with gcd(x, 15) = 1 is randomly chosen, and that the classical 
computer generates a "lookup table" by computing the four bit number x a (mod 15) for 
a = 0,1,2,3. The classical computer then instructs the quantum computer to execute 
a sequence of operations that prepares the state Eq. ( |7.2|) . These operations require no 
scratch space at all, so only L + K = 6 qubits of storage are needed to prepare the entangled 
state. 

The "worst case" (most complex lookup table) is x = 7 or x = 13. The lookup table for 
x = 7 is: 



a 


7 a (mod 15) 




1 

1 
1 1 


1 
111 
10 
110 1 


ai a,Q 


h h h b 



(7.3) 

An operator 

EXPN(x = 7,N= 15) Qi/3 : \a)* a \0) p .— > |a>;|7 a (mod 15))^ (7.4) 
that recreates this table can be constructed as 

(7.5) 

The two NOT's at the beginning generate a "table" that is all l's in the (3q and columns, 
and all 0's in the (5\ and columns. The remaining operations fix the one incorrect entry 
in each row of the table. Thus, we have constructed an EXPN operator with complexity 

[EXPN(7,15)} = [6,0,4]; (7.6) 

it can be implemented with 34 laser pulses on the Cirac-Zoller device. Since two additional 
pulses suffice to prepare the input register in the superposition state 

I T,\ah (7.7) 

before EXPN acts, we need 36 laser pulses to prepare the entangled state Eq. (\l.2\). 

The EXPN operator constructed in Eq. ( [7.5|) acts trivially on the input g-number a. Of 
course, this feature is not necessary; as long as the output state has the right correlations 
between the |-)* and \-)g registers, we will successfully prepare the entangled state Eq. 
( [7.2p . By exploiting this observation, we can achieve another modest improvement in the 
complexity of EXPN; we see that 

C[«i ,a } ,p 3 Ca C[«i ,a ] A ^ai C\ ai jQo ] i( g 2 C ao C[ ai |Q , ] j/3l Cp 2 Cp . (7.8) 
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applied to the input Eq. (|7.7|) also produces the output Eq. (|7.2|), even though it flips the 
value of di. Compared to Eq. 5|) , we do without the final NOT gate, and hence save one 
laser pulse. We can do better still by invoking the "custom gates" described in Appendix 
A; another implementation of the EXPN operator is 

EXPN (x, N)a tl3 = Glai,aoI,^i^[S 1 ,ao],/32^Iai,aoI^o^[Q!i,ao],/33^/32C'/3o ■ ( 7 - 9 ) 



Here, C^^i,^ f° r example, is a gate that flips the value of qubit (3 2 if and only if both 
qubit «x and qubit a have the value zero rather than one (see Appendix A). Each custom 
gate in Eq. ( |7T9| ) can be implemented with 7 laser pulses. Hence, compared to Eq. ( |7.5| ) we 
save 4 pulses, and the state Eq. (|7.2|) can be prepared with just 32 pulses. 

To complete the task of "factoring 15," it only remains to perform the Fourier transform 
on the input register and read it out. The measured value, the result of our quantum 
computation, will be a nonnegative integer y < 2 L satisfying 

V = integer 

2 L r V ; 



where r is the order of x mod N (r=4 in the case iV=15 and x=7), and the integer takes a 
random value ranging from to r — 1. (Here the probability distribution for y is actually 
perfectly peaked at the values in Eq. (|7.10|) , because r divides 2 L .) Thus, if we perform 
the Fourier transform with L = 2, the result for y is a completely random number ranging 
over y — 0,1, 2, 3. (Even so, by reducing y/A to lowest terms, we succeed in recovering the 
correct value of r with probability 1/2.) 

It is a bit disappointing to go to all the trouble to prepare the state Eq. ( |7.2j ) only to 
read out a random number in the end. If we wish, we can increase the number of qubits L 
of the input register (though the EXPN operator will still act only on the last two qubits). 
Then the outcome of the calculation will be a random multiple of 2 L ~ 2 . But the probability 
of recovering the correct value of r is still 1/2. 

Once we have found r = 4, a classical computer calculates 7^ 4 / 2 - ) ±1 =3,5 (mod N), 
which are, in fact, the factors of N = 15. Since the L = 2 Fourier transform can be 
performed using L(2L — 1) = 6 laser pulses on the ion trap, we can "factor 15" with 38 
pulses (not counting the final reading out of the device). For values of x other than 7 and 
13, the number of pulses required is even smaller. 



VIII. TESTING THE FOURIER TRANSFORM 



In Shor's factorization algorithm, a periodic function (the modular exponential function) 
is computed, creating entanglement between the input register and the output register of 
our quantum computer. Then the Fourier transform is applied to the input register, and the 
input register is read. In Sec. |V1J| , we noted that a simple demonstration of this procedure 
(factorization of 15) could be carried out on a linear ion trap, requiring only a modest 
number of laser pulses. 
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Here we point out an even simpler demonstration of the principle underlying Shor's 
algorithm. Consider the function 

f K (a) = a (mod 2 K ) . (8.1) 

Evaluation of this function is very easy, since it merely copies the last K bits of the argument 
a. A unitary operator MOD 2 k that acts according to 

MOD 2 k : \a)* a \0) fi .— > \a)*Ja (mod 2 K )) p (8.2) 

can be constructed as 

MOD 2 k = C'[ aif _ 1 ] i/3if _ 1 • • •C , [ ai ] A C , [ ao ] i/ 3 (8.3) 

(where |-)* is an L-qubit register and (0)^ is a K-qubit register). These K controlled-NOT 
operations can be accomplished with 5K laser pulses in the ion trap. Including the L single 
qubit rotations needed to prepare the input register, then, the entangled state 

-1/E|a>> (mod 2% (8.4) 

Z a=0 

can be generated with 5K + L pulses. 

Now we can Fourier transform the input register (L(2L — 1) pulses), and read it out. 
Since the period 2 K of fx divides 2 L , the Fourier transform should be perfectly peaked about 
values of y that satisfy 

y = 2 L ~ K ■ (integer) (8.5) 

Thus, Uk-1, ■ ■ ■ , Hi, Ho should be identically zero, while yi-i, ■ ■ ■ , Uk+i, Vk take random val- 
ues. 

The very simplest demonstration of this type (L — 2, K — 1) requires only three ions. 
Since fi has period 2, the two-qubit input register, after Fourier transforming, should read 
yi = random, y = 0. This demonstration can be performed with 13 laser pulses (not 
counting the final reading out), and should be feasible with current technology. 
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APPENDIX A: CUSTOM GATES 



In the algorithms that we have described in this paper, we have used the controlled fc - 
NOT operator as our fundamental quantum gate. Of course, there is much arbitrariness 
in this choice. For example, instead of the operation C[i 1 ,...,i fc jj, which flips qubit j if and 
only if qubits ii, . . . ,i k all take the value 1, we could employ a gate that flips qubit j if and 
only if . . .i k is some other specified string of k bits. This generalized gate, like C[i 1 ,...,i fc ]j 
itself, can easily be implemented on, say, a linear ion trap. We remark here that using such 
"custom gates" can reduce the complexity of some algorithms (as measured by the total 
number of laser pulses required). 

To see how these generalized gates can be constructed using the ion trap, we note first 
of all that if we apply an appropriately tuned 3ti pulse (instead of a n pulse) to the zth ion,[| 



then the operation Wph on defined in Eq. ( |3.1| ) is replaced by 



phon 



l#>J°>CM 
l e )J°)cM 



l2>il°>CM 
i l^) i l 1 )cM 



(Al) 



(whose nontrivial action differs by a sign from that of W^ on ). With and W$ on , we 

can construct an alternative conditional phase gate 



W, 



(0 

phon 



• v {j) ■ w. 



that acts nontrivially only if e 
conditional phase gate becomes 

-i 



phon ' 

1 and 77 = 



\eh\rih — {-l^mrih (A2) 
0. With an appropriate change of basis, this 



C, 



jjii) . y(hi) . — fj(j) 



-1 



• w. 



phon 

e) i \r j ®e 



■ V® ■ ML ■ U U) : 



phon 



1), 



(A3) 



a modified controlled-NOT gate that flips the target qubit if and only if the control qubit 
reads zero (compare Eq. (|3.7|) ). Like the controlled-NOT gate, then, Cpy can be imple- 
mented with 5 laser pulses. Following the discussion in Sec. [Ill (J|, it is straightforward to 



construct a modified controlled fc -NOT gate with a specified "custom" control string, for any 
k > 1. 

As a simple illustration of how a reduction in complexity can be achieved by using custom 
gates, consider the full adder FA(a) defined by Eq. (|5.4|J5.6| ) and shown in Fig. j2|. We can 
replace FA(1) by the alternative implementation 



F A' (a - l)l,2,3 = ^[T],2 ( %,2],3 C '[l],3 



(A4) 



(where the i indicates that qubit % must have the value (not 1) for the gate to act nontriv- 
ially). This saves one NOT gate, and hence one laser pulse, compared to the implementation 
in Eq. (|5.6| ). Another example of the use of custom gates is described in Sec. [VI \ 



Alternatively, we can implement W$ on with a 7r pulse if the laser phase is appropriately adjusted. 
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FIGURES 
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FIG. 1. The controlled fc -NOT gate. Input values of the qubits are shown on the right and 
output values on the left. This gate flips the value of the target qubit if all k control qubits take 
the value 1; otherwise, the gate acts trivially. 
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FIG. 2. The full adder FA{a). The order of the gates (here and in all of the following figures) 
is to be read from right to left. The gate array shown in (a) adds the classical bit a = 0; the second 
qubit carries the output sum bit and the third qubit carries the output carry bit. The gate array 
shown in (b) adds the classical bit a = 1. 
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FIG. 3. The multiplexed full adder MUXFA'(ao, a\). Here £ is the select bit that determines 
whether ao or ai is chosen as the classical addend. In (a), the case ao = 0, ai = 1 is shown; the 
gate array adds the qubit £-which is the same as ao for £ = and a\ for £ = 1. In (b), the case 
oq = 1, ai = is shown; the array adds 
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FIG. 4. The multiplexed full adder MUXFA(ao,a\) has a select bit £ and an enable string C. 
If all the bits of C take the value 1, then MUXFA acts in the same way as MUXFA' defined in 
Fig. 3. Otherwise, the classical addend is chosen to be 0. 
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FIG. 5. The multiplexed full adder MUXFA"(ao, a±) (shown here for ao = 0,a± = 1) is a 
modification of MU XFA that uses an extra bit of scratch space. The first gate stores C A £ in 
the extra scratch qubit, and subsequent gates use this scratch bit as a control bit. The last gate 
clears the scratch bit. The advantage of MUXFA" is that the longest control string required by 
any gate is shorter by one bit than the longest control string required in MU XFA. 
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FIG. 6. The multiplexed full adder MUXFA'" (a , a±) (shown here for a = 0, a\ = 1) uses 
simpler gates than those required by MUXFA, but unlike MUXFA" , it does not need an extra 
scratch bit. 
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FIG. 7. The multiplexed half adder MUXHA is simpler than MUXFA because it does not 
compute the output carry bit. 
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FIG. 8. The multiplexed K-h\t adder MADD(a, a') is constructed by chaining together K — 1 
MUXFA operations and one MUXHA operation. MADD adds a if-bit c-number to an input 
X-bit (/-number and obtains an output i^-bit q- number (the final carry bit is not computed). If 
MADD is enabled, the classical addend is a when the select bit has the value i = or is a' when 
i = 1. (When MADD is not enabled, the classical addend is 0.) 
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FIG. 9. The mod N addition operator ADDN(a,N) computes a + b (mod N), where a is a 
i^-bit c-number and 6 is a -fT-bit (/-number. When ADDN is enabled, the comparison operator 
XLT(N — a) flips the value of the select bit to i = 1 if a + b < N; then the multiplexed adder 
MADD(2 K + a-N,a) chooses the c-number addend to be a for I = 1 and 2 K + a - N for I = 0. 
XLT uses and then clears If -bits of scratch space before MADD writes the mod N sum there. 
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FIG. 10. The overwriting mod iV addition operator OADDN {a, N) (when enabled) adds the 
c-number a to the g-number b, and then erases b. The "swapping of the leads" is a classical 
operation, not a quantum gate. OADDN uses and then clears K + 1 bits of scratch space; this 
scratch space is suppressed on the left side of the figure. 
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FIG. 11. The mod N multiplication operator MULN(a, N) (when enabled) computes 
a ■ b (mod N), where a is a c-number and b is a q- number; it is constructed by chaining together 
K — 1 OADDN operators and one EMUL operator. 
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FIG. 12. The overwriting mod N multiplication operator OMULN(a, N) (when enabled) com- 
putes a ■ b (mod N) and then erases the g-number b. The XOR gates at the end (when enabled) 
swap the contents of the two registers. OMULN uses and then clears IK + 1 qubits of scratch 
space, of which only K bits are indicated in the figure. 
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FIG. 13. The modified mod N multiplication routine MULN'(a, N) uses simpler elementary 
gates than those used by MULN, but MULN' requires an extra bit of scratch space. Instead of 
calling the OADDN routine with two enable bits, MULN' first stores the AND of the two enable 
bits in the extra scratch bit. Then OADDN with one enable bit can be called instead, where the 
scratch bit is the enable bit. 
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FIG. 14. The mod N exponentiation operator EXPN(x, N) computes x a (mod N), where x is 
a K-bit c-number and a is an L-bit g-number. It is constructed by chaining together L OMULN 
operators and a NOT. The 2K + 1 qubits of scratch space used by EXPN are suppressed in the 
figure. The first OMU LN in the chain can be replaced by a simpler operation, as is discussed in 
the text. 
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